OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj orn D - - PowerPoint PPT Presentation

operating system support for redundant multithreading
SMART_READER_LITE
LIVE PREVIEW

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj orn D - - PowerPoint PPT Presentation

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj orn D obel (TU Dresden) Hermann H artig (TU Dresden) Michael Engel (TU Dortmund) Tampere, 08.10.2012 Fault Tolerance: State of the Union Software errors Hardware errors


slide-1
SLIDE 1

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING

Bj ¨

  • rn D ¨
  • bel (TU Dresden)

Hermann H¨ artig (TU Dresden) Michael Engel (TU Dortmund)

Tampere, 08.10.2012

slide-2
SLIDE 2

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

Operating System Support for Redundant Multithreading slide 1 of 13

slide-3
SLIDE 3

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

RAD-hard CPUs Redundant Multithr. Operating System Support for Redundant Multithreading slide 1 of 13

slide-4
SLIDE 4

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

RAD-hard CPUs Redundant Multithr. HP NonStop IBM z/OS Operating System Support for Redundant Multithreading slide 1 of 13

slide-5
SLIDE 5

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

RAD-hard CPUs Redundant Multithr. HP NonStop IBM z/OS SeL4 Minix3 Carburizer Operating System Support for Redundant Multithreading slide 1 of 13

slide-6
SLIDE 6

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

RAD-hard CPUs Redundant Multithr. HP NonStop IBM z/OS SeL4 Minix3 Carburizer SWIFT Encoded Processing Operating System Support for Redundant Multithreading slide 1 of 13

slide-7
SLIDE 7

Fault Tolerance: State of the Union

non- COTS COTS Hardware errors Software errors

RAD-hard CPUs Redundant Multithr. HP NonStop IBM z/OS SeL4 Minix3 Carburizer SWIFT Encoded Processing Romain Operating System Support for Redundant Multithreading slide 1 of 13

slide-8
SLIDE 8

Process-Level Redundancy [Shye 2007]

Binary recompilation

  • Complex, unprotected compiler
  • Architecture-dependent

System calls for replica synchronization Virtual memory fault isolation

  • Restricted to Linux user-level programs

Operating System Support for Redundant Multithreading slide 2 of 13

slide-9
SLIDE 9

Process-Level Redundancy [Shye 2007]

Binary recompilation

  • Complex, unprotected compiler
  • Architecture-dependent

Reuse OS mechanisms System calls for replica synchronization Additional synchronization events Virtual memory fault isolation

  • Restricted to Linux user-level programs

Microkernel-based

Operating System Support for Redundant Multithreading slide 2 of 13

slide-10
SLIDE 10

Transparent Replication as OS Service

Application L4 Runtime Environment L4/Fiasco.OC microkernel

Operating System Support for Redundant Multithreading slide 3 of 13

slide-11
SLIDE 11

Transparent Replication as OS Service

Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

Operating System Support for Redundant Multithreading slide 3 of 13

slide-12
SLIDE 12

Transparent Replication as OS Service

Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

Operating System Support for Redundant Multithreading slide 3 of 13

slide-13
SLIDE 13

Transparent Replication as OS Service

Replicated Driver Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

Operating System Support for Redundant Multithreading slide 3 of 13

slide-14
SLIDE 14

Transparent Replication as OS Service

Reliable Computing Base Replicated Driver Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

Operating System Support for Redundant Multithreading slide 3 of 13

slide-15
SLIDE 15

Romain: Structure

Master

Operating System Support for Redundant Multithreading slide 4 of 13

slide-16
SLIDE 16

Romain: Structure

Replica Replica Replica Master

Operating System Support for Redundant Multithreading slide 4 of 13

slide-17
SLIDE 17

Romain: Structure

Replica Replica Replica Master =

Operating System Support for Redundant Multithreading slide 4 of 13

slide-18
SLIDE 18

Romain: Structure

Replica Replica Replica Master System Call Proxy Resource Manager =

Operating System Support for Redundant Multithreading slide 4 of 13

slide-19
SLIDE 19

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1

Operating System Support for Redundant Multithreading slide 5 of 13

slide-20
SLIDE 20

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2

Operating System Support for Redundant Multithreading slide 5 of 13

slide-21
SLIDE 21

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2 1 2 3 4 5 6 Master

Operating System Support for Redundant Multithreading slide 5 of 13

slide-22
SLIDE 22

Partitioned Capability Tables

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2 1 2 3 4 5 6 Master Marked used Master private

Operating System Support for Redundant Multithreading slide 6 of 13

slide-23
SLIDE 23

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

Operating System Support for Redundant Multithreading slide 7 of 13

slide-24
SLIDE 24

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

Operating System Support for Redundant Multithreading slide 7 of 13

slide-25
SLIDE 25

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

Operating System Support for Redundant Multithreading slide 7 of 13

slide-26
SLIDE 26

Shared Memory

  • Not in complete control of master
  • Standard technique: trap&emulate

– Execution overhead (x100 - x1000) – Adds complexity to RCB Disassembler 6,000 LoC Tiny emulator 500 LoC

  • Our implementation: copy & execute

Operating System Support for Redundant Multithreading slide 8 of 13

slide-27
SLIDE 27

Copy&Execute

Master Replica

Operating System Support for Redundant Multithreading slide 9 of 13

slide-28
SLIDE 28

Copy&Execute

Master Replica mov eax, [ebx] X

Operating System Support for Redundant Multithreading slide 9 of 13

slide-29
SLIDE 29

Copy&Execute

Master Replica mov eax, [ebx]

Operating System Support for Redundant Multithreading slide 9 of 13

slide-30
SLIDE 30

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

Operating System Support for Redundant Multithreading slide 9 of 13

slide-31
SLIDE 31

Copy&Execute

Master Replica mov eax, [ebx] mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

Operating System Support for Redundant Multithreading slide 9 of 13

slide-32
SLIDE 32

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] Operating System Support for Redundant Multithreading slide 9 of 13

slide-33
SLIDE 33

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] Operating System Support for Redundant Multithreading slide 9 of 13

slide-34
SLIDE 34

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] Operating System Support for Redundant Multithreading slide 9 of 13

slide-35
SLIDE 35

Benchmarks

  • MiBench suite
  • Fault injection to confirm fault distribution ratios
  • Overhead for DMR and TMR
  • Microbenchmarks for shared memory

Operating System Support for Redundant Multithreading slide 10 of 13

slide-36
SLIDE 36

Overhead vs. Unreplicated Execution

Operating System Support for Redundant Multithreading slide 11 of 13

slide-37
SLIDE 37

Romain Lines of Code

Base code (main, logging, locking) 325 Application loader 375 Replica manager 628 Redundancy 153 Memory manager 445 System call proxy 311 Shared memory 281 T

  • tal

2,518 Fault injector 668 GDB server stub 1,304

Operating System Support for Redundant Multithreading slide 12 of 13

slide-38
SLIDE 38

Conclusion

  • Redundant Multithreading as an OS service
  • Support for binary-only applications
  • Overheads <30%, often <5%
  • Shared memory handling is slow
  • Work in progress:

– Multithreading – Device drivers

Operating System Support for Redundant Multithreading slide 13 of 13

slide-39
SLIDE 39

Nothing to see here

This slide intentionally left blank. Except for above text.

Operating System Support for Redundant Multithreading slide 14 of 13

slide-40
SLIDE 40

Hardening the RCB

  • We need: Dedicated mechanisms to

protect the RCB (HW or SW)

  • We have: Full control over software
  • Use FT
  • encoding compiler?

– Has not been done for kernel code yet – Only protects SW components

  • RAD-hardened hardware?

– Too expensive

  • Our proposal: Split HW into

ResCores and NonRes-Cores

ResCore NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core Operating System Support for Redundant Multithreading slide 15 of 13

slide-41
SLIDE 41

Signaling Performance

  • Overhead compared to single,

unreplicated run

  • Benchmarks with highest overhead

in EMSOFT paper

  • Test machine:

– 12x Intel Core2 2.6 GHz – Replicas pinned to dedicated physical cores – Hyperthreading off

10 20 30 40 50 60 Overhead in % Overhead by notification method Local Faults Migration Sync IPC Shared Mem

susan CRC32 DMR susan CRC32 TMR

Operating System Support for Redundant Multithreading slide 16 of 13

slide-42
SLIDE 42

What about signalling failures?

Missed CPU exceptions → detected by watchdog Spurious CPU exceptions → detected by watchdog / state comparison Transmission of corrupt state → detected during state comparison Overwriting remote state during transmission

  • NonResCore memory
  • Accessible by ResCores, but not by other NonResCores
  • Prevents overwriting other states
  • Already available in HW: IBM/Cell

Operating System Support for Redundant Multithreading slide 17 of 13

slide-43
SLIDE 43

Romain

http://www.dynamo-dresden.de Operating System Support for Redundant Multithreading slide 18 of 13