Preliminary Investigations into a Microkernel OSAL for cFS Gregor - - PowerPoint PPT Presentation

preliminary investigations into a microkernel osal for cfs
SMART_READER_LITE
LIVE PREVIEW

Preliminary Investigations into a Microkernel OSAL for cFS Gregor - - PowerPoint PPT Presentation

Preliminary Investigations into a Microkernel OSAL for cFS Gregor Peach, Joseph Espy, Zach Day, Gabriel Parmer , Alex Maloney Gerald Fry*, Curt Wu* The George Washington University * Charles River Analytics Acknowledgements: This material is


slide-1
SLIDE 1

Preliminary Investigations into a Microkernel OSAL for cFS

Gregor Peach, Joseph Espy, Zach Day, Gabriel Parmer, Alex Maloney Gerald Fry*, Curt Wu* The George Washington University * Charles River Analytics

Acknowledgements: This material is based upon work supported by the National Science Foundation under Grant No. CNS 1149675, ONR Award No. N00014-14-1-0386, and ONR STTR N00014-15-P-1182 and N68335-17-C-0153. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or ONR.

slide-2
SLIDE 2

Traditional Satellites

Fault tolerance

  • Hardware redundancy
  • Rad-hardened processors
  • Single-core processors
slide-3
SLIDE 3

CubeSats

Commodity hardware

  • High clock speed
  • Multi-core
  • Limited hardware reliability features

g

slide-4
SLIDE 4

CubeSats

Commodity hardware

  • High clock speed
  • Multi-core
  • Limited hardware reliability features

Spare capacity + no HW reliability SW reliability →

g

slide-5
SLIDE 5

CubeSats

Commodity hardware

  • High clock speed
  • Multi-core
  • Limited hardware reliability features

How to most efgectively use the parallelism?

g

slide-6
SLIDE 6

How can we use extra computational capacity to increase fault tolerance?

slide-7
SLIDE 7

Aspects of SW Fault Tolerance

Remediation How do we return system to a well-defjned state Propagation How do we contain the scope of the fault Detection Determine when system is in an erroneous state

slide-8
SLIDE 8

Aspects of SW Fault Tolerance

Remediation How do we return system to a well-defjned state Propagation How do we contain the scope of the fault Detection Determine when system is in an erroneous state

slide-9
SLIDE 9

Core Flight System

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-10
SLIDE 10

Core Flight System – Faults

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-11
SLIDE 11

Core Flight System – Faults

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-12
SLIDE 12

Core Flight System – Faults

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-13
SLIDE 13

Core Flight System – Faults + POSIX

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-14
SLIDE 14
slide-15
SLIDE 15

SW Bus Tables Mutex HK ... ... CS S Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-16
SLIDE 16

Core Flight System

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-17
SLIDE 17

Composite μ-kernel

Small kernel (~7K LoC), real-time focus

  • Focused on IPC between protection domains

Export policies to user-level components

  • Scheduling, dev. drivers, memory mgmt, FS, ...

NIC Scheduler Kernel User-level Interrupt vectoring Memory mapping Sync/async IPC

slide-18
SLIDE 18

Composite OSAL/PSP

SW Bus Tables Mutex HK ... ... CS S Sched FS Loader Net PSP Mission-specific applications General utility applications Core Flight Executive Functions Operating System Abstraction Layer

slide-19
SLIDE 19

Composite OSAL/PSP

  • Communication explicitly controlled by design
  • IPC and scheduling are fast:

Composite Linux 2-way IPC 700 cycles 600 (syscall), 3500 (pipes) Thd Dispatch 300 cycles 1800 (yield)

SW Bus Mutex Tables HK S CS Sched Load FS Driver Net NIC

slide-20
SLIDE 20

Composite OSAL/PSP – Current

  • Fixed priority preemptive scheduling
  • RAM-based FS
  • Application loader:

– Into shared protection domain – Into separate protection domains

SW Bus, Mutex, Tables HK S CS Sched/load/FS/net

slide-21
SLIDE 21

Composite OSAL/PSP – Current

  • Lines of C Code: < 4000 LoC
  • OSAL unit tests: > 89% successful

  • score/osfjle/osfjlesys/osloader, 15% not relevant (OS call failure)
  • In progress:

– serialization/deserialization of OSAL arguments – increasing application support

SW Bus, Mutex, Tables HK S CS Sched/load/FS/net

slide-22
SLIDE 22

Aspects of SW Fault Tolerance

Remediation How do we return system to a well-defjned state Propagation How do we contain the scope of the fault Detection Determine when system is in an erroneous state

slide-23
SLIDE 23

Watchdog Timer

  • Applications

– periodically declare successful execution

  • Every watchdog timer (1-10 seconds):

– Have all applications and system components

checked in?

– No: reboot!

Detection Remediation

Reboot Watchdog Timer

slide-24
SLIDE 24

Redundant Execution

Detection Remediation

Double M. Redundancy Triple M. Redundancy

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP

Voter

slide-25
SLIDE 25 SW Bus Tables Mutex

...

HK

...

CS S

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP

Redundant Execution

Detection Remediation

Double M. Redundancy Triple M. Redundancy

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP

Voter Voter

SW Bus Tables Mutex

...

HK

...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP Voter Voter
slide-26
SLIDE 26 SW Bus Tables Mutex

...

HK

...

CS S

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP

Redundant Execution

Detection Remediation

Double M. Redundancy Triple M. Redundancy

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP

Voter Voter

SW Bus Tables Mutex

...

HK

...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP Voter Voter

Composite Voter (in-progress)

  • < 800 LoC in Rust
  • Utilize high-performance IPC + scheduling
  • Design: minimize...

...memory footprint ...CPU footprint

slide-27
SLIDE 27

Checkpoint/Restore

Detection Remediation

Checkpoint/Restore

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

time Checkpoint

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

Checkpoint

slide-28
SLIDE 28

Checkpoint/Restore

Detection Remediation

Checkpoint/Restore

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

time Checkpoint

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

Checkpoint

slide-29
SLIDE 29

Checkpoint/Restore

Detection Remediation

Checkpoint/Restore

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

time Checkpoint

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

Checkpoint Restore

slide-30
SLIDE 30

Checkpoint/Restore

Detection Remediation

Checkpoint/Restore

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load Net PSP SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

time Checkpoint

SW Bus Tables Mutex

HK

... ...

CS S

Sched FS Load NetPSP

Checkpoint Restore

Composite Checkpoint/Restore

* 1MB + 512 MB Composite* Linux/CRIU* Xen+ Checkpoint 0.2 ms 800ms 8s Restore 0.2 ms 500ms 10s Increases at rate of memcpy

slide-31
SLIDE 31

Computational Crash Cart

Recover system-level components upon failure

  • Record summary of component comms
  • Reboot component + re-estabilish state

Focus on real-time

  • 10s of micro-second recovery time

Complementary to application-level reliability

  • Checkpoint/Redundant execution

Detection Remediation

Computational Crash Cart

slide-32
SLIDE 32

Monitoring for Detection

Monitor/log system interactions and timing

  • API calls, context switches, interrupts, …

Process log

  • Interactions deviate from system model?
  • Interactions statistically deviate from

historically correct behaviors?

Detection Remediation

Monitoring + ML Composite Monitoring

slide-33
SLIDE 33

How can we effectively use the parallelism of commodity CPUs?

slide-34
SLIDE 34

Composite + Parallelism

Kernel designed to be lock-less

  • Kernel operations are all wait-free

real-time →

  • IPC core-local, or inter-core

HK S CS FS Driver Net NIC SW Bus Mutex Tables

slide-35
SLIDE 35

Composite + Parallelism

Kernel designed to be lock-less

  • Kernel operations are all wait-free

real-time →

  • IPC core-local, or inter-core

HK S CS FS Driver Net NIC SW Bus Mutex Tables

Composite parallelism orchestration

Example: OpenMP fork/join parallelism

  • 2-40x decrease in worst-case inter-core

communication latencies

  • Up to 40 cores, across 4 sockets
slide-36
SLIDE 36

CubeSats: Fresh View on cFE

How can we efgectively

  • utilize new resource availability, and
  • provide SW fault tolerance?

→ Composite OSAL enables new options Where do we go from here?

  • Need your feedback...
slide-37
SLIDE 37

? || /* */

slide-38
SLIDE 38
  • Computational Crash Cart, Checkpointing:

http://www2.seas.gwu.edu/~gparmer/publications/rtss13_c3.pdf

  • Model-based event monitoring:

http://www2.seas.gwu.edu/~gparmer/publications/rtas15cmon_extended.pdf

  • ML-based event monitoring:

http://www2.seas.gwu.edu/~gparmer/publications/certs16caml.pdf

  • Micro-benchmarks and virtualization:

http://www2.seas.gwu.edu/~gparmer/publications/rtss17tcaps.pdf

  • Lock-free, predictable kernel:

http://www2.seas.gwu.edu/~gparmer/publications/rtas15speck.pdf

  • OpenMP Fork/Join parallelism:

http://www2.seas.gwu.edu/~gparmer/publications/rtas14_fjos.pdf