HermitCore A Unikernel for Extreme Scale Computing Stefan Lankes 1 , - - PowerPoint PPT Presentation

hermitcore a unikernel for extreme scale computing
SMART_READER_LITE
LIVE PREVIEW

HermitCore A Unikernel for Extreme Scale Computing Stefan Lankes 1 , - - PowerPoint PPT Presentation

HermitCore A Unikernel for Extreme Scale Computing Stefan Lankes 1 , Simon Pickartz 1 , Jens Breitbart 2 1 RWTH Aachen University, Germany 2 Technische Universitt Mnchen, Germany Agenda Motivation OS Architectures HermitCore Design


slide-1
SLIDE 1

HermitCore – A Unikernel for Extreme Scale Computing

Stefan Lankes1, Simon Pickartz1, Jens Breitbart2

1RWTH Aachen University, Germany 2Technische Universität München, Germany

slide-2
SLIDE 2

Agenda

Motivation OS Architectures HermitCore Design Performance Evaluation Conclusion and Outlook

2 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-3
SLIDE 3

Motivation

Yet Another Multi-Kernel Approach Nearly the same motivation like Balazs Gerofi et al.1 Complexity of high-end HPC systems keeps growing

Extreme degree of parallelism Heterogeneous core architectures Deep memory hierarchy Power constrains

⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HW

Applications have also become complex

In-situ analysis, workflows Sophisticated monitoring and tools support, etc. . . Isolated, consistent simulation performance

⇒ Dependence on POSIX and the rich Linux APIs

Seemingly contradictory requirements. . .

  • 1B. Gerofi et al. “Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme

Scale Computing”. In: 5th Int. Workshop on Runtime and Operating Systems for Supercomputers. 2015.

3 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-4
SLIDE 4

Motivation

Yet Another Multi-Kernel Approach Nearly the same motivation like Balazs Gerofi et al.1 Complexity of high-end HPC systems keeps growing

Extreme degree of parallelism Heterogeneous core architectures Deep memory hierarchy Power constrains

⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HW

Applications have also become complex

In-situ analysis, workflows Sophisticated monitoring and tools support, etc. . . Isolated, consistent simulation performance

⇒ Dependence on POSIX and the rich Linux APIs, MPI and OpenMP

Seemingly contradictory requirements. . .

  • 1B. Gerofi et al. “Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme

Scale Computing”. In: 5th Int. Workshop on Runtime and Operating Systems for Supercomputers. 2015.

3 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-5
SLIDE 5

OS Architectures Light-weight and / or Multi-Kernels for HPC

mOS, McKernel, Catamount, ZeptoOS, FusedOS, L4, FFMK, Hobbes, Kitten, CNK. . . Detailed analyzes in the next talk2

Unikernels / LibraryOS

Basic ideas already developed in the Exokernel Era

Each process has it own hardware abstraction layer

Regained relevance in the area of cloud computing (e. g., IncludeOS, MirageOS)

With Qemu / KVM the abstraction layer is already defined

HermitCore is a combination of a multi-kernel and a unikernel

  • 2B. Gerofi et al. “A Multi-Kernel Survey for High-Performance Computing”.

In: 6th Int. Workshop on Runtime and Operating Systems for Supercomputers. 2016.

4 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-6
SLIDE 6

OS Designs for Cloud Computing – LibraryOS

Operating System

eth0

Hypervisor

Software Virtual Switch eth0

libOS Application

eth0

libOS Application

Now, every system call is a function call ⇒ Low overhead

5 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-7
SLIDE 7

HermitCore – Basic ideas

Combination of the Unikernel and Multi-Kernel to reduce the overhead

Support of bare-metal execution Unikernel ⇒ system calls are realized as function call

Single-address space operating system ⇒ No TLB Shootdown System software should be designed for the hardware

Hierarchical approach (like the hardware)

One kernel per NUMA node

Only local memory accesses (UMA) Message passing between NUMA nodes

Support of dominant programming models (MPI, OpenMP) One FWK (Linux) in the system to get access to a broader driver support

Only a backup for pre- / post-processing Critical path should be handled by HermitCore

Most system calls handled by HermitCore

  • E. g., memory allocation, access to the network interface

6 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-8
SLIDE 8

Booting HermitCore

Hardware Linux kernel libc Proxy By detection of a HermitCore app, a proxy will be started.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-9
SLIDE 9

Booting HermitCore

Hardware Linux kernel libc Proxy By detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-10
SLIDE 10

Booting HermitCore

Hardware Linux kernel libc Proxy libos (LwIP, IRQ, etc.) Newlib OpenMP / MPI App By detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-11
SLIDE 11

Booting HermitCore

Hardware Linux kernel libc Proxy libos (LwIP, IRQ, etc.) Newlib OpenMP / MPI App By detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-12
SLIDE 12

Booting HermitCore

Hardware Linux kernel libc Proxy By detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established. By termination, the cores are set to the HALT state.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-13
SLIDE 13

Booting HermitCore

Hardware Linux kernel libc Proxy By detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established. By termination, the cores are set to the HALT state. Finally, reregistering of the cores to Linux.

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-14
SLIDE 14

HermitCore’s Toolchain (I) Memory Layout

.bss (uninitialized data) thread local storage / per core storage .data / .text (application code + data) .kdata / .ktext (kernel code + data) .boot (initialize kernel) libOS

Basic OS services (e. g., interrupt handling) are separated in a library Linked to a normal application like the C library A fix address for the init code is required

Defined in the linker script Part of HermitCore’s cross toolchain

GCC 5.3.0 & Binutils Support of C / C++ & Fortran

No changes to the common build process

8 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-15
SLIDE 15

HermitCore’s Toolchain (II) Memory Layout

.bss (uninitialized data) thread local storage / per core storage .data / .text (application code + data) .kdata / .ktext (kernel code + data) .boot (initialize kernel) libOS

Transparant loading of HermitCore apps Definition of a new ELF ABI

Only the magic number for the OS has been changed in the ELF format Minor modifications to GCC & binutils

By Linux support of miscellaneous binary formats (binfmt), the loader checks the magic number for the OS

  • 1. Detection of the magic number
  • 2. Starting the proxy
  • 3. Proxy initiates via sysfs the boot

process of HermitCore apps

No changes to the common build process

9 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-16
SLIDE 16

Runtime Support

SSE, AVX, FMA,. . . Full C-library support (newlib) IP interface & BSD sockets (LwIP)

IP packets are forwarded to Linux Shared memory interface

Pthreads

Thread binding at start time No load balancing ⇒ less housekeeping

OpenMP iRCCE- & MPI (via SCC-MPICH)

R

1

R

3 2

R

5 4

R

7 6

R

9 8

R

11 10

R

13 12

R

15 14

R

17 16

R

19 18

R

21 20

R

23 22

R

25 24

R

27 26

R

29 28

R

31 30

R

33 32

R

35 34

R

37 36

R

39 38

R

41 40

R

43 42

R

45 44

R

47 46 MC 1 MC 0 MC 3 MC 2 FPGA Router

Tile

MIU MPB Core 23 Core 22 L2$ L2$

10 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-17
SLIDE 17

OpenMP Runtime

GCC includes a OpenMP Runtime (libgomp)

Reuse synchronization primitives of the Pthread library Other OpenMP runtimes scales better In addition, our Pthread library was originally not designed for HPC

Integration of Intel’s OpenMP Runtime

Include its own synchronization primitives Binary compatible to GCC’s OpenMP Runtime Changes for the HermitCore support are small

Mostly deactivation of function to define the thread affinity

Transparent usage

For the end-user, no changes in the build process

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-18
SLIDE 18

Support of compilers beside GCC

Just avoid the standard environment (−ffreestanding) Set include path to HermitCore’s toolchain Be sure that the ELF file use HermitCore’s ABI

Patching object files via elfedit

Use the GCC to link the binary

LD = x86_64 -hermit -gcc #CC = x86_64 -hermit -gcc #CFLAGS = -O3 -mtune=native -march=native -fopenmp -mno -red -zone CC = icc -D__hermit__ CFLAGS = -O3 -xHost -mno -red -zone -ffreestanding -I$(HERMIT_DIR) -openmp ELFEDIT = x86_64 -hermit -elfedit stream.o: stream.c $(CC) $(CFLAGS) -c -o $@ $< $(ELFEDIT) --output -osabi HermitCore $@ stream: stream.o $(LD) -o $@ $< $(LDFLAGS) $(CFLAGS)

12 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-19
SLIDE 19

Operating System Micro-Benchmarks

Test system

Intel Haswell CPUs (E5-2650 v3) clocked at 2.3 GHz 64 GiB DDR4 RAM and 25 MB L3 cache SpeedStep Technology and TurboMode are deactivated 4.2.5 Linux kernel on Fedora 23 (Workstation Edition) gcc 5.3.x, AVX- & FMA-Support enabled (−mtune=native)

Results in CPU cycles System activity HermitCore Linux getpid() 14 143 sched_yield() 97 370 write() 3520 1079 malloc() 3772 6575 first write access to a page 2014 4007

13 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-20
SLIDE 20

Hourglass Benchmark

Benchmarks reads permanently the time step counter (Larger) Gaps ⇒ OS takes computation time (e. g., for housekeeping, devices drivers) Results in CPU cycles OS Gaps Avg Max Linux 69 31068 HermitCore (w/ LwIP) 68 12688 HermitCore (w/o LwIP) 68 376

14 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-21
SLIDE 21

100002000030000 100 102 104 106 Loop time (cycles) Number of events Linux 100002000030000 100 102 104 106 Loop time (cycles) Number of events Hermit w LwIP 100002000030000 100 102 104 106 Loop time (cycles) Number of events Hermit wo LwIP

slide-22
SLIDE 22

EPCC OpenMP Micro-Benchmarks

2 4 6 8 10 1 2 3

Number of Threads Overhead in µs Parallel on Linux (gcc) Parallel For on Linux (gcc) Parallel on Linux (icc) Parallel For on Linux (icc) Parallel on HermitCore Parallel For on HermitCore

16 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-23
SLIDE 23

Throughput Results of the Inter-kernel Communication Layer

256 4 Ki 64 Ki 1 Mi 1 2 3 4 5

Size in Byte Throughput in GiB/s PingPong via iRCCE PingPong via SCC-MPICH PingPong via ParaStation MPI

17 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-24
SLIDE 24

Outlook

A fast direct access to the interconnect is required SR-IOV simplifies the coordination between Linux & HermitCore

Core Core

Memory NIC Linux Node 0

Core Core

Memory vNIC HermitCore Node 1

Core Core

Memory vNIC HermitCore Node 2

Core Core

Memory vNIC HermitCore Node 3

Virtual IP Device / Message Passing Interafce

18 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-25
SLIDE 25

Conclusions

Prototype works Nearly no OS noise First performance results are promising Suitable for Real-Time Computing? Try it out! http://www.hermitcore.org Thank you for your kind attention!

19 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-26
SLIDE 26

Backup slides

slide-27
SLIDE 27

Lack of programmability Non-Uniform Memory Access

Costs for memory access may vary Run processes where memory is allocated Allocate memory where the process resides Implications for the performance

Where should the applications store the data? Who should decide the location?

The operating system? The application developers?

Socket 0 Socket 1 Memory 0 Memory 1

Interconnect

21 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-28
SLIDE 28

Lack of programmability Non-Uniform Memory Access

Costs for memory access may vary Run processes where memory is allocated Allocate memory where the process resides Implications for the performance

Where should the applications store the data? Who should decide the location?

The operating system? The application developers?

Memory 0 Memory 1

Interconnect

Memory 2 Memory 3

21 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-29
SLIDE 29

Tuning Tricks

Parallelization via Shared Memory (OpenMP)

Many side-effects and error-prone Incremental parallelization

Parallelization via Message Passing (MPI)

Restructuring of the sequential code Less side-effects

Performance Tuning

Bind MPI applications on one NUMA node ⇒ No remote memory access

2x8 4x4 8x2 16x1 1 2 3 4 5 < threadcount > × < proccount > Speed in GFlop/s LU-MZ.C.16 BT-MZ.C.16 SP-MZ.C.16

22 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-30
SLIDE 30

OS Designs for Cloud Computing – Usage of Common OS

Operating System

eth0

Hypervisor

Software Virtual Switch eth0

OS Application

eth0

OS Application

Two operating systems to maintain one computer? Double Management!

23 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-31
SLIDE 31

OS Designs for Cloud Computing – Container

OS

eth0

Container Application Container Application

Building of virtual borders (namespaces) Containers and their processes doesn’t see each other Fast access to OS services Less secure because an exploit for the container attacks also the host OS

24 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-32
SLIDE 32

Comparison to related Unikernels

Rump kernels3

Part of NetBSD ⇒ e. g., NetBSD’s TCP / IP stack is available as library Strong dependencies to the hypervisor Not directly bootable on a standard hypervisor (e. g., KVM)

IncludeOS4

Runs natively on the hardware ⇒ minimal Overhead Neither 64 bit, nor SMP support

MirageOS5

Designed for the high-level language OCaml ⇒ uncommon in HPC

  • 3A. Kantee and J. Cormack. “Rump Kernels – No OS? No Problem!”

In: ; login: 2014.

  • 4A. Bratterud et al. “IncludeOS: A Resource Efficient Unikernel for Cloud Services”.

In: 7th Int. Conference on Cloud Computing Technology and Science. 2015.

  • 5A. Madhavapeddy et al. “Unikernels: Library Operating Systems for the Cloud”.

In: 8th Int. Conference on Architectural Support for Programming Languages and Operating Systems. 2013.

25 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-33
SLIDE 33

Is the software stack difficult to maintain?

Changes to the common software stack determined with cloc Software Stack LoC Changes binutils 5 121 217 226 gcc 6 850 382 4 821 Linux 15 276 013 1 296 Newlib 1 040 826 5 472 LwIP 38 883 832 Pthread 13 768 466 OpenMP RT 61 594 324 HermitCore – 10 597

26 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-34
SLIDE 34

Hourglass Benchmark

Benchmarks reads permanently the time step counter (Larger) Gaps ⇒ OS takes computation time (e. g., for housekeeping, devices drivers) Results in CPU cycles OS Gaps Avg Max Linux 69 31068 Linux (isolcpu) 69 51840 HermitCore (w/ LwIP) 68 12688 HermitCore (w/o LwIP) 68 376

27 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-35
SLIDE 35

100002000030000 100 102 104 106 Loop time (cycles) Number of events Linux 100002000030000 100 102 104 106 Loop time (cycles) Number of events Linux (isolcpu) 100002000030000 100 102 104 106 Loop time (cycles) Number of events Hermit w LwIP 100002000030000 100 102 104 106 Loop time (cycles) Number of events Hermit wo LwIP

slide-36
SLIDE 36

Hydro (preliminary results)

5 10 15 20 5,000 10,000 Number of Cores MFLOPS

Linux (1 process × n threads) Linux (1 proc. × n thr., bind-to 0–19) Linux (n proc. × 5 thr., bind-to numa) HermitCore (n proc. × 5 thr.)

29 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-37
SLIDE 37

Which kind of security do we need?

Unikernels ⇒ no system calls ⇒ unsecure? In HPC, security could be realized by a cluster management tool Could Intel’s MPX (Memory Protection Extensions) protect the kernel for uncontrolled access?

Part of the Skylake architecture MPX introduces new bounds registers to protect the system against buffer overflows Kernel could be the lower bound of a buffer. . .

A (bare-metal) hypervisor solves the problem completely

30 HermitCore | Stefan Lankes et al. | RWTH Aachen University | 1st June 2016

slide-38
SLIDE 38

Thank you for your kind attention! Stefan Lankes et al. – slankes@eonerc.rwth-aachen.de Institute for Automation of Complex Power Systems E.ON Energy Research Center, RWTH Aachen University Mathieustraße 10 52074 Aachen www.acs.eonerc.rwth-aachen.de