A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon - - PowerPoint PPT Presentation

a microkernel based operating system for exascale
SMART_READER_LITE
LIVE PREVIEW

A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon - - PowerPoint PPT Presentation

The Hebrew University of Jerusalem A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Hrtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center


slide-1
SLIDE 1

CARSTEN WEINHOLD, TU DRESDEN

A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING

The Hebrew University 


  • f Jerusalem

Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)

slide-2
SLIDE 2

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing

The ideal world assumption:

■ Identical, predictable, and reliable nodes ■ Fast and reliable interconnect ■ Balanced applications ■ Isolated partitions of fixed size

2

TRADITIONAL HPC

slide-3
SLIDE 3

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 3

TRADITIONAL HPC

Time

Fixed-size chunks of work One thread per core

slide-4
SLIDE 4

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing

Systems software:


  • ptimize communication latency

Message passing uses polling Batch scheduler for start / stop Separate servers for I/O Small OS on each node No OS on critical path

4

TRADITIONAL HPC

slide-5
SLIDE 5

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 5

LOAD

Application: CP2K

180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Computation time (fraction)

1 129 257 384 512 10 20 30 40 50 Process ID Timesteps Computation—communication ratio of CP2K on 512 cores

slide-6
SLIDE 6

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 6

LOAD

Application: COSMO-SPECS+FD4

20 40 60 80 100 120

180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Computation time (fraction)

1 33 65 96 128 30 60 90 120 150 180 Process ID Timesteps Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores

slide-7
SLIDE 7

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 7

REALITY CHECK

Hand balanced compute times

  • f ranks per

time step

Application: COSMO-SPECS+FD4

Unbalanced compute times

  • f ranks per

time step

Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores

slide-8
SLIDE 8

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 8

REALITY CHECK

Application: PRIME

Now think of:

  • Composite applications
  • In-situ visualization, etc.

Unbalanced compute times

  • f ranks per

time step

slide-9
SLIDE 9

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 9

FFMK

  • German Priority Programme 1648

“Software for Exascale Computing”

FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing

slide-10
SLIDE 10

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 10

CHALLENGES

Dynamic applications
 & platforms Increased fault rates Power / Dark silicon Heterogeneity (cores, memory, … )

FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS FFMK-OS

slide-11
SLIDE 11

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 11

NODE ARCHITECTURE

Service OS Runtime Application Application Application Light-weight Kernel (L4) Service cores Compute cores Monitoring Platform Management Runtime Support Linux (drivers, etc.)

slide-12
SLIDE 12

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 12

3 ABSTRACTIONS

Address space Address space

Light-weight Kernel (L4) Threads Threads

slide-13
SLIDE 13

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 13

MESSAGE PASSING

App App Light-weight Kernel (L4) File System Network I/O Device Interrupt

slide-14
SLIDE 14

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 14

BLOCK, POLL, IRET

  • Intel Core i7 3770S @ 3100 MHz
  • No Hyperthreading, no Turboboost
  • No dynamic power management
  • Same socket

Block (Linux) Block (L4) Polling 0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs

0,1 3,2 5,3

  • 64-bit Linux 3.11

.6 (cpuidle.off=0, intel_idle.max_cstate=0)

  • 64-bit L4/Fiasco.OC

Wake from interrupt on L4: 900 cycles, 0.3 µs (best case, on Intel Core i7-4770 CPU @ 3.40GHz)

slide-15
SLIDE 15

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 15

LINUX ON L4

Linux OS Linux App File Network I/O Light-weight Kernel (L4)

slide-16
SLIDE 16

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 16

HYBRID SYSTEM

Real-time Security: small Trusted Computing Base Resilience: small Reliable Computing Base uncritical complex critical simple

slide-17
SLIDE 17

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 17

HYBRID SYSTEM

uncritical complex critical simple

File Network I/O Service OS Light-weight Kernel (L4)

slide-18
SLIDE 18

The Hebrew University 


  • f Jerusalem

Service OS

A Microkernel-Based OS for Exascale Computing 18

HYBRID SYSTEM

Service Proxy File Network I/O MPI App MPI Lib Infiniband Light-weight Kernel (L4)

slide-19
SLIDE 19

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 19

DRIVER REUSE

L4 App libibverbs User-space Driver Linux Kernel IB Core Kernel Driver /dev/ib0 I/O Proxy App Light-weight Kernel (L4) Msg Buffer 1 Msg Buffer 2 Msg Buffer 1 Msg Buffer 2 I/O

slide-20
SLIDE 20

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 20

NODE ARCHITECTURE

Service OS Runtime Application Application Runtime Linux Kernel Application Application Application Light-weight Kernel (L4) Proxies MPI Library MPI Infiniband Infiniband Service cores Compute cores Chkpt. Checkpoint

slide-21
SLIDE 21

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 21

FAILURE RATES

Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12

MTBF for Component Failures in an HPC System Failed Component # of Nodes Affected MTBF PFS, core switch 1408 65.10 days Rack 32 86.90 days Edge switch 16 17 .37 days PSU 4 28.94 days Compute nodes 1 15.8 hours

slide-22
SLIDE 22

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing

In-memory XtreemFS volume for application-level checkpointing RAID-5 erasure coding: recovery with 1 failed OSD Demonstrator running BQCD code on a Cray XC30

22

XTREEMFS

DIR MRC OSD OSD OSD OSD Application (BQCD) …

slide-23
SLIDE 23

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 23

BANDWIDTH

0.1 1 10 100 1000 10000 1K 2K 4K 8K 16K 32K 64K 96K

TB/s Nodes

memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn

(b) Sequoia Cluster (IBM Blue Gene/Q)

Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13

slide-24
SLIDE 24

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 24

NODE ARCHITECTURE

Service OS Runtime Application Application Runtime Linux Kernel Application Application Application Light-weight Kernel (L4) Proxies MPI Library MPI Platform Management Infiniband Infiniband Service cores Compute cores Chkpt. Checkpoint

slide-25
SLIDE 25

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 25

OVERDECOMPOSITION

processing units time Barrier

slide-26
SLIDE 26

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 26

OVERDECOMPOSITION

Barrier processing units time

slide-27
SLIDE 27

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 27

OVERDECOMPOSITION

processing units time Barrier

slide-28
SLIDE 28

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 28

OVERDECOMPOSITION

Barrier processing units time

slide-29
SLIDE 29

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 29

OVERDECOMPOSITION

0 s 200 s 400 s 600 s 800 s 1000 s

Baseline 2x 4x 8x 16x COSMO-SPECS+FD4 (unbalanced, HT) Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)

slide-30
SLIDE 30

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 30

BUSY WAITING

Busy waiting = Computation

slide-31
SLIDE 31

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 31

OVERDECOMPOSITION

0 s 200 s 400 s 600 s 800 s 1000 s

Baseline 2x 4x 8x 16x COSMO-SPECS+FD4 (unbalanced, HT)

Polling (busy waiting)

Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)

slide-32
SLIDE 32

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 32

OVERDECOMPOSITION

0 s 50 s 100 s 150 s 200 s 250 s 300 s

Baseline 2x 4x 8x 16x COSMO-SPECS+FD4 (balanced, no HT) COSMO-SPECS+FD4 (unbalanced, no HT) Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)

slide-33
SLIDE 33

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 33

OVERDECOMPOSITION

0 s 50 s 100 s 150 s 200 s 250 s 300 s 350 s

Baseline 2x 4x 8x 16x CP2K (unbalanced, no HT) Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)

slide-34
SLIDE 34

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 34

OVERDECOMPOSITION

Barrier processing units time

With MPI:

  • Do not: busy wait (except very shortly)
  • Do: Block in kernel
  • Needs: fast unblocking of threads, when message comes in
  • We build: shortcut from IB driver into MPI threads (no Linux!)
slide-35
SLIDE 35

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 35

CENTRALIZED

… ……. .. .. .. .. .. .. .. .. …….

slide-36
SLIDE 36

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 36

DECENTRALIZED

Node 1 Node 2 Node n

… ……. .. .. .. .. .. .. .. .. …….

slide-37
SLIDE 37

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 37

DECENTRALIZED

……. .. .. .. .. .. .. .. .. …….

Node 1 Node 2

Node n

slide-38
SLIDE 38

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 38

DECENTRALIZED

……. .. .. .. .. .. .. .. .. …….

Node 1 Node 2

Node n

slide-39
SLIDE 39

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 39

GOSSIP SCALABILITY

10 20 1024 Nodes 19.0 s 19.0 s 19.0 s 19.0 s 19.1 s 19.2 s 19.5 s 20.0 s 12.2 s 12.2 s 12.2 s 12.2 s 12.3 s 12.4 s 12.7 s 13.2 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms

MPI-FFT running on BG/Q “JUQUEEN”

Low overhead: No noticeable

  • verhead at gossip

interval of 64–256 ms

10 20 30 40 50 1024 Nodes 40.6 s 40.6 s 40.6 s 40.6 s 40.6 s 40.6 s 40.7 s 40.8 s 41.1 s 8.2 s 8.2 s 8.2 s 8.2 s 8.2 s 8.2 s 8.3 s 8.4 s 8.6 s 10 20 30 40 50 2048 Nodes 36.7 s 36.7 s 36.7 s 36.7 s 36.8 s 36.9 s 37.2 s 38.0 s 4.3 s 4.2 s 4.2 s 4.2 s 4.3 s 4.4 s 4.7 s 5.5 s Without Gossip Interval = 1024 ms Interval = 256 ms Interval = 64 ms Interval = 16 ms Interval = 8 ms Interval = 4 ms Interval = 2 ms Interval = 1 ms

COSMO-SPECS+FD4 on BG/Q “JUQUEEN“

Quality of information: Average age at nodes in the order of 2–3 s with gossip interval of 256 ms

  • E. Levy, A. Barak, A. Shiloh, M. Lieber, C. Weinhold, and H. Härtig, „Overhead of a Decentralized

Gossip Algorithm on the Performance of HPC Applications“, ROSS 2014

slide-40
SLIDE 40

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 40

GOSSIP VS FAULTS

Table 6: Simulations of the average master’s age with failed node Number of Circulating local failed nodes windows of size per colony 16 32 64 128 256 11.74 9.67 8.66 8.20 8.07 1 11.71 9.72 8.67 8.21 8.07 2 11.75 9.68 8.70 8.21 8.08 4 11.81 9.73 8.70 8.23 8.11 8 11.83 9.79 8.72 8.28 8.17 16 11.95 9.90 8.79 8.34 8.20 32 12.12 10.05 8.96 8.48 8.36 Standard deviation 0.49 0.42 0.37 0.36 0.36 Increase rate 3.2% 3.9% 3.5% 3.4% 3.6%

Average age at master (1024 nodes per colony)

Gossip is fault tolerant: Only slight increase in average age when substantial number of nodes fail (up to 32 of 1024 in each colony)

  • A. Barak, Z. Drezner, E. Levy, M. Lieber, and A. Shiloh, „Resilient gossip algorithms for

collecting online management information in exascale clusters“, Concurrency and Computation: Practice and Experience, 2015

slide-41
SLIDE 41

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 41

DECENTRALIZED

……. .. .. .. .. .. .. .. .. …….

Node 1 Node 2

Node n

slide-42
SLIDE 42

The Hebrew University 


  • f Jerusalem

When 
 MOSIX: Load difference discovered
 Where
 MOSIX: Memory, cycles, comm
 Which
 MOSIX: Past predicts future 


A Microkernel-Based OS for Exascale Computing 42

DECENTRALIZED

Node 1 Node 2

Node n

When 
 + Anomaly anticipated Where
 
 + Topology Which
 + Application knowledge

slide-43
SLIDE 43

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 43

NODE ARCHITECTURE

Service OS Runtime Application Application Runtime Linux Kernel Application Application Application Light-weight Kernel (L4) Proxies MPI Library MPI Platform Management Decision Making Gossip Infiniband Infiniband Monitor Monitor Service cores Compute cores Chkpt. Checkpoint

slide-44
SLIDE 44

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 44

MANAGEMENT

Decision Making

Hardware monitoring Migration Platform info (Gossip)

Resource Prediction

Application hints Past behavior Future behavior?

slide-45
SLIDE 45

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 45

APPLICATION HINTS

Resource prediction:

CPU cycles Cache misses Memory Energy?

slide-46
SLIDE 46

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 46

THERMAL PLANNING

processing units time Barrier

slide-47
SLIDE 47

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 47

THERMAL PLANNING

Barrier processing units time

slide-48
SLIDE 48

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 48

THERMAL PLANNING

Barrier processing units time

slide-49
SLIDE 49

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing

Resource prediction:

CPU cycles Cache misses Memory Energy?

Fault tolerance:

Which data to checkpoint (and when) Ability to handle node failures

49

APPLICATION HINTS

slide-50
SLIDE 50

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing 50

NODE ARCHITECTURE

Service OS Runtime Application Application Runtime Linux Kernel Application Application Application Light-weight Kernel (L4) Proxies MPI Library MPI Platform Management Decision Making Gossip Infiniband Infiniband Monitor Monitor Service cores Compute cores Chkpt. Checkpoint

slide-51
SLIDE 51

The Hebrew University 


  • f Jerusalem

A Microkernel-Based OS for Exascale Computing

FFMK prototype runs on real HPC system We build OS/R for Exascale:

Load management Fault tolerance

Take burden from application developers

51

SUMMARY

ffmk.tudos.org