Case for Transforming Parallel Run-times Into Operating System - - PowerPoint PPT Presentation

case for transforming parallel run times into operating
SMART_READER_LITE
LIVE PREVIEW

Case for Transforming Parallel Run-times Into Operating System - - PowerPoint PPT Presentation

Case for Transforming Parallel Run-times Into Operating System Kernels Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016 Table of Contents Introduction Evaluation Development effort Conclusion Table of


slide-1
SLIDE 1

Case for Transforming Parallel Run-times Into Operating System Kernels

Paper Reading Group Kyle Hale Peter Dinda Presents: Maksym Planeta 18.02.2016

slide-2
SLIDE 2

Table of Contents

Introduction Evaluation Development effort Conclusion

slide-3
SLIDE 3

Table of Contents

Introduction Evaluation Development effort Conclusion

slide-4
SLIDE 4

What is this project?

  • 1. Northwestern University, Sandia Labs, Oak Ridge
  • 2. Part of Hobbes Project
  • 3. They also develop Palacios
slide-5
SLIDE 5

Why is it interesting for us?

◮ Proposes a microkernel

slide-6
SLIDE 6

Why is it interesting for us?

◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context

slide-7
SLIDE 7

Why is it interesting for us?

◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi

slide-8
SLIDE 8

Why is it interesting for us?

◮ Proposes a microkernel ◮ Uses hyperthreads in HPC context ◮ Targets Xeon Phi ◮ It cites L4 paper:

[40] J. Liedtke. On micro-kernel construction. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP 1995), pages 237–250, Dec. 1995.

slide-9
SLIDE 9

Idea

  • 1. HPC app runs in user mode
  • 2. Hardware available in kernel mode
  • 3. When an HPC program runs in kernel mode:

3.1 All nice features are directly available

slide-10
SLIDE 10

A Typical dialog with the kernel

slide-11
SLIDE 11

5

user mode kernel mode

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES? NOT ALWAYS

I’d like to pin memory to a specific PFN range please NO! runtime general OS

slide-12
SLIDE 12

6

user mode kernel mode

NOT ALWAYS

I’d like to never be interrupted please NOPE runtime general OS

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES?

slide-13
SLIDE 13

7

user mode kernel mode

RESTRICTED ACCESS TO HARDWARE

I’d like to set up some custom page mappings please Uh no runtime general OS

slide-14
SLIDE 14

8

user mode kernel mode I’d like to interrupt another processor please HA! runtime general OS

RESTRICTED ACCESS TO HARDWARE

slide-15
SLIDE 15

Motivation

  • 1. HPC app runs in user mode
  • 2. Hardware available in kernel mode
  • 3. When an HPC program runs in kernel mode:

3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes.

slide-16
SLIDE 16

Motivation

  • 1. HPC app runs in user mode
  • 2. Hardware available in kernel mode
  • 3. When an HPC program runs in kernel mode:

3.1 All nice features are directly available 3.2 Kernel does not restrict the program with bad abstractions: For example, the run-time might need subset barriers, and be forced to build them out of mutexes. 3.3 Kernel may waste resources for the features the application doesn’t need: For example, the run-time might not require coherence, but get it anyway.

slide-17
SLIDE 17

Contributions

  • 1. Criticize traditional architecture
  • 2. Propose a new OS structure
  • 3. Port some of the existing runtimes
slide-18
SLIDE 18

Hybrid Run-Time (HRT)

Parallel&App& Parallel&Run,-me& General&Kernel& Node&HW&

User%Mode% Kernel%Mode%

(a) Current Model Parallel&App& Hybrid&Run,-me& (HRT)& Node&HW&

User%Mode% Kernel%Mode%

(b) Hybrid Run-time Model Performance*Path*

slide-19
SLIDE 19

What is HRT?

◮ The runtime is the kernel, built within a kernel framework

slide-20
SLIDE 20

What is HRT?

◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space

slide-21
SLIDE 21

What is HRT?

◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware

slide-22
SLIDE 22

What is HRT?

◮ The runtime is the kernel, built within a kernel framework ◮ Everything is kernel space ◮ HRT has full access to the hardware ◮ HRT can pick its own abstractions

slide-23
SLIDE 23

Aerokernel

Runtime Paging Threads Bootstrap Timers Hardware IRQs Console Nautilus Topology Sync. Parallel Application Kernel Mode User Mode HRT Kernel Events

Figure 2: Structure of Nautilus.

slide-24
SLIDE 24

Benefits

◮ Better abstractions ◮ Noiseless ◮ Lightweight

slide-25
SLIDE 25

Legacy support

Parallel&Run,-me& General&Kernel& Node&HW&

User%Mode% Kernel%Mode%

Parallel&App& Hybrid&Run,-me& (HRT)&

User%Mode% Kernel%Mode%

Hybrid&Virtual&Machine&(HVM)& Specialized& Virtualiza-on& Model& General& Virtualiza-on& Model& Performance*Path* Parallel&App& Legacy*Path* (c) Hybrid Run-time Model Within a Hybrid Virtual Machine

slide-26
SLIDE 26

Table of Contents

Introduction Evaluation Development effort Conclusion

slide-27
SLIDE 27

Thread creation

1x106 2x106 3x106 4x106 5x106 6x106 7x106 2 4 8 16 32 64 Cycles Threads (a) Nautilus Linux 1x106 2x106 3x106 4x106 5x106 6x106 7x106 2 4 8 16 32 64 Cycles Threads (b) Nautilus Linux 1x106 2x106 3x106 4x106 5x106 6x106 7x106 2 4 8 16 32 64 Cycles Threads (c) Nautilus Linux

ce igh

Figure: Average, minimum, and maximum time to create a number of threads in sequence.

slide-28
SLIDE 28

Thread creation

8 10 12 14 16 18 20 22 24 26 2 4 8 16 32 64 Speedup Threads (d) 10 15 20 25 30 35 40 2 4 8 16 32 64 Speedup Threads (e) 5 10 15 20 25 30 35 40 45 2 4 8 16 32 64 Speedup Threads (f)

ce igh

Figure:

Linux Nautilus from previous figure

slide-29
SLIDE 29

Thread creation

8 10 12 14 16 18 20 22 24 26 2 4 8 16 32 64 Speedup Threads (d) 10 15 20 25 30 35 40 2 4 8 16 32 64 Speedup Threads (e) 5 10 15 20 25 30 35 40 45 2 4 8 16 32 64 Speedup Threads (f)

ce igh

Figure:

Linux Nautilus from previous figure

Why bends? At (d) at 8 threads, (e) at 32, and (f) at 8. Bugs?

slide-30
SLIDE 30

Thread creation (summary)

Figure 2: Struct OS Avg Min Max Nautilus 16795 2907 44264 Linux 38456 34447 238866 Figure 3: Time to create a single thread measured in cycles.

slide-31
SLIDE 31

Spinlock microbenchmark

OS Execution time (s) Nautilus 13.72 Linux 12.53 OS

  • Avg. acquire/release time (cycles)

Nautilus 59 Linux 36 Figure 5: Total time to acquire and release a spinlock 500 million times on Nautilus and Linux, and average time in cycles for an acquire/release pair.

slide-32
SLIDE 32

Wake-up microbenchmark

5000 10000 15000 20000 25000 30000 Linux

  • N. MWAIT
  • N. condvar
  • N. w/kick

Cycles

not available in userspace

  • verhead too high

in userspace Figure 6: Average event wakeup latency.

slide-33
SLIDE 33

Circuit simulator benchmark

10 20 30 40 50 60 70 80 90 100 110 2 4 8 16 32 62 Runtime (s) Legion Processors (threads) Nautilus Linux

Figure 11: Run time of Legion circuit simulator versus core

  • count. The baseline Nautilus version has higher performance

at 62 cores than the Linux version.

slide-34
SLIDE 34

Circuit simulator benchmark

2 4 6 8 10 12 14 16 2 4 8 16 32 62 Speedup Legion Processors (threads) Nautilus Linux

Figure 12: Speedup of Legion (normalized to 2 Legion pro- cessors) circuit simulator running on Linux and Nautilus as a function of Legion processor (thread) count.

slide-35
SLIDE 35

Circuit simulator benchmark

0.5 % 1 % 1.5 % 2 % 2.5 % 3 % 3.5 % 4 % 4.5 % 5 % 2 4 8 16 32 62 Speedup

Figure 13: Speedup of Legion circuit simulator comparing the baseline Nautilus version and a Nautilus version that executes Legion tasks with interrupts off.

slide-36
SLIDE 36

Table of Contents

Introduction Evaluation Development effort Conclusion

slide-37
SLIDE 37

Kernel development

The process of building Nautilus as minimal kernel layer with support for a complex, modern, many-core x86 machine took six person-months of effort on the part of seasoned OS/VMM kernel developers. Language SLOC C 22697 C++ 133 x86 Assembly 428 Scripting 706 Figure 9: Source lines of code for the Nautilus kernel.

slide-38
SLIDE 38

Run-time support

Porting Legion:

◮ 43000 SLOC in C++ ◮ Most of the work went into understating Legion ◮ Some code added to Nautilus

Language SLOC C++ 133 C 636 Figure 10: Lines of code added to Nautilus to support Le- gion, NDPC, and NESL.

◮ Four person-months to port

Also porting NESL and NDPC (related to each other).

slide-39
SLIDE 39

Table of Contents

Introduction Evaluation Development effort Conclusion

slide-40
SLIDE 40

Conclusion

◮ A mikrokernel ◮ And a lightweight kernel ◮ Requires effort for porting ◮ Early stage of development