Case for Transforming Parallel Run-times Into Operating System - PowerPoint PPT Presentation

Case for Transforming Parallel Run-times Into Operating System Kernels Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016

Table of Contents Introduction Evaluation Development effort Conclusion

What is this project? 1. Northwestern University, Sandia Labs, Oak Ridge 2. Part of Hobbes Project 3. They also develop Palacios

Why is it interesting for us? ◮ Proposes a microkernel

Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context

Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi

Why is it interesting for us? ◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi ◮ It cites L4 paper: [40] J. Liedtke. On micro-kernel construction. In Proceedings of the 15 th ACM Symposium on Operating Systems Principles (SOSP 1995) , pages 237–250, Dec. 1995.

Idea 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available

A Typical dialog with the kernel

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES? I’d like to pin memory to a specific PFN range please user mode runtime kernel mode NO! general OS NOT ALWAYS 5

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES? I’d like to never be interrupted please user mode runtime kernel mode NOPE general OS NOT ALWAYS 6

RESTRICTED ACCESS TO HARDWARE I’d like to set up some custom page mappings please user mode runtime kernel mode Uh no general OS 7

RESTRICTED ACCESS TO HARDWARE I’d like to interrupt another processor please user mode runtime kernel mode HA! general OS 8

Motivation 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes.

Motivation 1. HPC app runs in user mode 2. Hardware available in kernel mode 3. When an HPC program runs in kernel mode: 3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes. 3.3 Kernel may waste resources for the features the application doesn’t need: For example, the run-time might not require coherence, but get it anyway.

Contributions 1. Criticize traditional architecture 2. Propose a new OS structure 3. Port some of the existing runtimes

Hybrid Run-Time (HRT) User%Mode% Parallel&App& Kernel%Mode% Performance*Path* Parallel&App& User%Mode% Parallel&Run,-me& Hybrid&Run,-me& General&Kernel& Kernel%Mode% (HRT)& Node&HW& Node&HW& (a) Current Model (b) Hybrid Run-time Model

What is HRT? ◮ The runtime is the kernel, built within a kernel framework

What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space

What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware

What is HRT? ◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware ◮ HRT can pick its own abstractions

Aerokernel User Mode Kernel Mode Parallel Application Kernel Runtime HRT Threads Sync. Paging Events Topology Bootstrap Timers IRQs Console Nautilus Hardware Figure 2: Structure of Nautilus.

Benefits ◮ Better abstractions ◮ Noiseless ◮ Lightweight

Legacy support User%Mode% Kernel%Mode% Parallel&App& Parallel&App& Parallel&Run,-me& User%Mode% Hybrid&Run,-me& Legacy*Path* (HRT)& General&Kernel& Kernel%Mode% Performance*Path* General& Specialized& Virtualiza-on& Virtualiza-on& Model& Model& Hybrid&Virtual&Machine&(HVM)& Node&HW& (c) Hybrid Run-time Model Within a Hybrid Virtual Machine

Thread creation (a) (b) (c) 7x10 6 7x10 6 7x10 6 Nautilus Nautilus Nautilus 6x10 6 6x10 6 6x10 6 Linux Linux Linux 5x10 6 5x10 6 5x10 6 Cycles Cycles Cycles 4x10 6 4x10 6 4x10 6 3x10 6 3x10 6 3x10 6 2x10 6 2x10 6 2x10 6 1x10 6 1x10 6 1x10 6 0 0 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Figure: Average, minimum, and maximum time to create a number of threads in sequence. ce igh

Thread creation (d) (e) (f) 26 40 45 24 40 35 22 35 Speedup 20 Speedup 30 Speedup 30 18 25 25 16 20 14 20 15 12 10 15 10 5 8 10 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Linux Figure: Nautilus from previous figure ce igh

Thread creation (d) (e) (f) 26 40 45 24 40 35 22 35 20 30 30 Speedup Speedup Speedup 18 25 25 16 20 14 20 15 12 10 15 10 5 8 10 0 2 4 8 16 32 64 2 4 8 16 32 64 2 4 8 16 32 64 Threads Threads Threads Linux Figure: Nautilus from previous figure ce Why bends? At (d) at 8 threads, (e) at 32, and (f) at 8. Bugs? igh

Thread creation (summary) Figure 2: Struct OS Avg Min Max Nautilus 16795 2907 44264 Linux 38456 34447 238866 Figure 3: Time to create a single thread measured in cycles.

Spinlock microbenchmark OS Execution time (s) Nautilus 13.72 Linux 12.53 OS Avg. acquire/release time (cycles) Nautilus 59 Linux 36 Figure 5: Total time to acquire and release a spinlock 500 million times on Nautilus and Linux, and average time in cycles for an acquire/release pair.

Wake-up microbenchmark 30000 not available in userspace 25000 20000 Cycles overhead too high 15000 in userspace 10000 5000 0 Linux N. MWAIT N. condvar N. w/kick Figure 6: Average event wakeup latency.

Circuit simulator benchmark 110 Nautilus Linux 100 90 80 70 Runtime (s) 60 50 40 30 20 10 0 2 4 8 16 32 62 Legion Processors (threads) Figure 11: Run time of Legion circuit simulator versus core count. The baseline Nautilus version has higher performance at 62 cores than the Linux version.

Circuit simulator benchmark 16 Nautilus Linux 14 12 10 Speedup 8 6 4 2 0 2 4 8 16 32 62 Legion Processors (threads) Figure 12: Speedup of Legion (normalized to 2 Legion processors) circuit simulator running on Linux and Nautilus as a function of Legion processor (thread) count.

Circuit simulator benchmark 5 % 4.5 % 4 % 3.5 % Speedup 3 % 2.5 % 2 % 1.5 % 1 % 0.5 % 2 4 8 16 32 62 Figure 13: Speedup of Legion circuit simulator comparing the baseline Nautilus version and a Nautilus version that executes Legion tasks with interrupts off.

Kernel development The process of building Nautilus as minimal kernel layer with support for a complex, modern, many-core x86 machine took six person-months of effort on the part of seasoned OS/VMM kernel developers. Language SLOC C 22697 C++ 133 x86 Assembly 428 Scripting 706 Figure 9: Source lines of code for the Nautilus kernel.

Run-time support Porting Legion: ◮ 43000 SLOC in C++ ◮ Most of the work went into understating Legion ◮ Some code added to Nautilus Language SLOC C++ 133 C 636 Figure 10: Lines of code added to Nautilus to support Le- gion, NDPC, and NESL. ◮ Four person-months to port Also porting NESL and NDPC (related to each other).

Conclusion ◮ A mikrokernel ◮ And a lightweight kernel ◮ Requires effort for porting ◮ Early stage of development

Case for Transforming Parallel Run-times Into Operating System - PowerPoint PPT Presentation

Case for Transforming Parallel Run-times Into Operating System Kernels Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016 Table of Contents Introduction Evaluation Development effort Conclusion Table of

TRANSFORMING TRANSFORMING TRANSFORMING TRANSFORMING FINANCIAL SERVICES FINANCIAL SERVICES FOR

TIMES TABLES HOW WE TEACH TIMES TABLES AND HOW YOU CAN HELP WHY ARE TIMES TABLES IMPORTANT?

MSJC Equity Pledge Board of Trustees Meeting November 14, 2019 Equity @ MSJC Equity @ MSJC

The Case for Run- -Time Error Checking Time Error Checking The Case for Run Todd Austin

COVID-19 VIRTUAL FORUM STRATEGY IN UNCERTAIN TIMES COVID-19: STRATEGY IN UNCERTAIN TIMES APRIL

Muddy Run/Conowingo Recreation Sites and Facilities Consultation Presentation September 14-15,

Outdoor Heritage Projects Blood Run Blood Run Oak Forest Blood Run 2012 Big Sioux River overlook

+ Characterization of Miller Run and Conceptual Plan for Characterization of Miller Run and

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

Transforming the US Birth Experience BETSY MCNAMARA CONSULTANT TO THE TRANSFORMING BIRTH FUND

East Lancashire Transforming Lives Programme Steve Rides East Lancashire Transforming Lives

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

Seven years of data taking and analysis of the data of the Explorer and Nautilus g.w. detectors

Supreme Court Overturns Federal Circuit on Liability for Inducing Infringement and Standard for

Nautilus Automates, You Analyze Reproducible Analysis of Sensor Data in R Dr. Benjamin Ashwell

Patent Law Prof. Roger Ford February 15, 2016 Class 5 Disclosure: Definiteness Recap Recap

NAUTILUS IPv6 Mobility Convergence and Deployment http://www.nautilus6.org Thierry Ernst Keio

An An opportunis istic ic HT HTCo Condor po pool insid in ide an an in interac activ

Virtual Realities and Digital Virtual Realities and Digital Health Adaptive Technology Health

Overview of Research in the HExSA Lab @ IIT Laboratory for High-performance Experimental Systems