ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING - - PowerPoint PPT Presentation

asteroid an analyzable resilient embedded real time
SMART_READER_LITE
LIVE PREVIEW

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING - - PowerPoint PPT Presentation

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN Bj orn D obel, Hermann H artig (TU Dresden) Philip Axer, Rolf Ernst (TU Braunschweig) B oblingen, 07 .02.2013 The Many Faces of Hardware Faults


slide-1
SLIDE 1

ASTEROID – AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN

Bj ¨

  • rn D ¨
  • bel, Hermann H¨

artig (TU Dresden) Philip Axer, Rolf Ernst (TU Braunschweig)

B ¨

  • blingen, 07

.02.2013

slide-2
SLIDE 2

The Many Faces of Hardware Faults

  • Radiation-induced soft errors

– Mainly an issue in avionics+space1

1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project, 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study, SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design, ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population, FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic, DSN 2002 ASTEROID slide 1 of 18

slide-3
SLIDE 3

The Many Faces of Hardware Faults

  • Radiation-induced soft errors

– Mainly an issue in avionics+space1

  • DRAM errors in large data centers

– Google Study: > 2% failing DRAM DIMMs per year2 – ECC is not going to even detect a significant amount3 – Disk failure rate about 5%4

1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project, 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study, SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design, ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population, FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic, DSN 2002 ASTEROID slide 1 of 18

slide-4
SLIDE 4

The Many Faces of Hardware Faults

  • Radiation-induced soft errors

– Mainly an issue in avionics+space1

  • DRAM errors in large data centers

– Google Study: > 2% failing DRAM DIMMs per year2 – ECC is not going to even detect a significant amount3 – Disk failure rate about 5%4

  • Furthermore: decreasing transistor sizes, higher rate of

transient errors in CPU functional units5

1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project, 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study, SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design, ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population, FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic, DSN 2002 ASTEROID slide 1 of 18

slide-5
SLIDE 5

Transparent Replication as OS Service

Application L4 Runtime Environment L4/Fiasco.OC microkernel

ASTEROID slide 2 of 18

slide-6
SLIDE 6

Transparent Replication as OS Service

Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

ASTEROID slide 2 of 18

slide-7
SLIDE 7

Transparent Replication as OS Service

Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

ASTEROID slide 2 of 18

slide-8
SLIDE 8

Transparent Replication as OS Service

Replicated Driver Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

ASTEROID slide 2 of 18

slide-9
SLIDE 9

Transparent Replication as OS Service

Reliable Computing Base6 Replicated Driver Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

6 D ¨

  • bel, H¨

artig, Engel Operating System Support for Redundant Multithreading, EMSOFT 2012 ASTEROID slide 2 of 18

slide-10
SLIDE 10

Hardening the RCB

  • Use FT
  • encoding compiler?

– Has not been done for kernel code yet – Only protects SW components

  • RAD-hardened hardware?

– Too expensive – Rather provide small, separate building blocks

7 D ¨

  • bel, H¨

artig: Who watches the watchmen? – Protecting Operating System Reliability Mechanisms, HotDep 2012 ASTEROID slide 3 of 18

slide-11
SLIDE 11

Hardening the RCB

  • Use FT
  • encoding compiler?

– Has not been done for kernel code yet – Only protects SW components

  • RAD-hardened hardware?

– Too expensive – Rather provide small, separate building blocks Our proposal: Split HW into ResCores and NonRes-Cores7

ResCore NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core NonRes Core 7 D ¨

  • bel, H¨

artig: Who watches the watchmen? – Protecting Operating System Reliability Mechanisms, HotDep 2012 ASTEROID slide 3 of 18

slide-12
SLIDE 12

Fast State Comparison in Hardware

Fingerprinting: Signature generation and checking in HW8

  • XOR, CRC8, CRC16, CRC32
  • little hardware overhead: 3-9%
  • used for basic-block signature checking

IF ID MEM WB Instruction FP

+

EXE RA X Data FP

+

result inst Chunk CNT exception retire

8 Axer, Ernst, D ¨

  • bel, H¨

artig: Designing an Analyzable and Resilient Embedded Operating System, SOBRES 2012 ASTEROID slide 4 of 18

slide-13
SLIDE 13

Basic-Block Signature (SparcV8 - LLVM)

  • 1. Custom LLVM pass annotates basic blocks
  • 2. Signature setup loads a precomputed instruction signature
  • 3. Fingerprint is computed during basic block execution
  • 4. Signature is automatically checked on controll flow changes (e.g. jmp)
  • 5. Trap is caused if signature does not match

Signature Setup Basic Block Signature check Entry Exit Signature generation ASTEROID slide 5 of 18

slide-14
SLIDE 14

Prelimiary Results - LLVM Assisted Signature Checking

  • 1. Signature checking detects invalid control-flow errors
  • 2. Code-size overhead in the order of 30%, since most basic blocks in our

benchmarks were 6.6 instructions.

  • 3. How many control-flow errors are expected?

Error-Injection experiment F D RA E 0% 20% 40% 60% 80% 100% data correct and no control flow change data correct and control flow change data incorrect (dc) exception (exc) not terminating (nt)

ASTEROID slide 6 of 18

slide-15
SLIDE 15

Three Paths for ASTEROID v2

  • 1. Increased protection against hardware errors

– Not only the processor might fail! – How do we perform real-time analysis? – Can we analyze software’s vulnerability?

  • 2. Extended support for replicating software

– Shared Memory – Multithreading – Device I/O

  • 3. Integrated HW/SW platform

ASTEROID slide 7 of 18

slide-16
SLIDE 16

Protecting Against Errors in the NoC

  • First step: harden NoC routers, detect packet abnormalities
  • Problems: data corruption in header (route, type, . . . )
  • Protect packets with checksums (i.e. CRC)

ASTEROID slide 8 of 18

slide-17
SLIDE 17

Real-Time Analysis

Timing effects in the hardware architecture (e.g. retransmission on buses, NoC)9

signaling overhead reexecution τ1 τ2 C1 C 2 E1

t

C 2 higher priority interference

20 40 60 80 100 120 140 160 180 t[ms] 10−15 10−14 10−13 10−12 10−11 10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100 X+(t)

F1 b F1 c F1 m F9 b F9 c F9 m 9 Axer, Ernst Stochastic Response-Time Guarantee for Non-Premptive, Fixed-Priority Scheduling Under Error, 2013, to appear ASTEROID slide 9 of 18

slide-18
SLIDE 18

Real-Time Analysis

  • 1. Timing effects in software (e.g. replication in Romain framework)
  • 2. Problem: redundant copies must wait and synchronize on state externalization.
  • 3. Solution: Response-time analysis for the Romain framework
  • 4. Results: amount of externalization and priority has a major impact on response

times:

Load core 1 0.0 0.2 0.4 0.6 0.8 1.0 L

  • a

d c

  • r

e 2 0.0 0.2 0.4 0.6 0.8 1.0 WCRT [ms] 100 200 300 400 500 600 700 800

Bitcount Parallel Sequential

Load core 1 0.0 0.2 0.4 0.6 0.8 1.0 L

  • a

d c

  • r

e 2 0.0 0.2 0.4 0.6 0.8 1.0 WCRT [ms] 200 400 600 800 1000 1200 1400 1600

Rijndael Parallel Sequential

10 10 Real-Time Analysis with pyCPA – http://code.google.com/p/pycpa ASTEROID slide 10 of 18

slide-19
SLIDE 19

Analyzing Program Vulnerability

  • Varying application fault

tolerance requirements

  • Optimize resource usage for

replication

  • Analyze RCB components to

know what to protect

  • ASTEROID + DanceOS +

FEHLER:11 – Evaluate usefulness of PVF vs. fault injection – Outline challenges and possible solutions

Register: EDX 0.2 0.4 0.6 0.8 1 PVF 0.2 0.4 0.6 0.8 1 FI Failure Ratio 0.5 50 100 150 200 250 300 350

  • Abs. diff.

| PVF - FI | Time [10k Instruction Blocks] 11 D ¨

  • bel, Schirmeier, Engel: Investigating the Limitations of PVF for Program Vulnerability Analysis, DFR 2013

ASTEROID slide 11 of 18

slide-20
SLIDE 20

Shared Memory

  • Not in complete control of master
  • Standard technique: trap&emulate

– Execution overhead (x100 - x1000) – Adds complexity to RCB Disassembler 6,000 LoC Tiny emulator 500 LoC

  • Our implementation: copy & execute

ASTEROID slide 12 of 18

slide-21
SLIDE 21

Copy&Execute

Master Replica

ASTEROID slide 13 of 18

slide-22
SLIDE 22

Copy&Execute

Master Replica mov eax, [ebx] X

ASTEROID slide 13 of 18

slide-23
SLIDE 23

Copy&Execute

Master Replica mov eax, [ebx]

ASTEROID slide 13 of 18

slide-24
SLIDE 24

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

ASTEROID slide 13 of 18

slide-25
SLIDE 25

Copy&Execute

Master Replica mov eax, [ebx] mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

ASTEROID slide 13 of 18

slide-26
SLIDE 26

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] ASTEROID slide 13 of 18

slide-27
SLIDE 27

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] ASTEROID slide 13 of 18

slide-28
SLIDE 28

Copy&Execute

Master Replica mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state

mov eax, [ebx] ASTEROID slide 13 of 18

slide-29
SLIDE 29

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4

ASTEROID slide 14 of 18

slide-30
SLIDE 30

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3

ASTEROID slide 14 of 18

slide-31
SLIDE 31

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3 C1 C2 C3 C4 C1 C2 C3 C4

ASTEROID slide 14 of 18

slide-32
SLIDE 32

Problem: Nondeterminism

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3 C1 C2 C3 C3 C1 C2 C3 C4

ASTEROID slide 15 of 18

slide-33
SLIDE 33

Problem: Nondeterminism

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 C1 C2 C3 C3 C1 C2 C3 B1 B2 B3 B4

ASTEROID slide 15 of 18

slide-34
SLIDE 34

Deterministic Multithreading

  • Related work: Debugging multithreaded programs

12 Liu, Curtsinger, Berger: DThreads: Efficient Deterministic Multithreading, OSDI 2011 13 Olszewski, Ansel, Amarasinghe: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 ASTEROID slide 16 of 18

slide-35
SLIDE 35

Deterministic Multithreading

  • Related work: Debugging multithreaded programs
  • Strong Determinism: All accesses to shared resources

happen in the same order12.

– Consider all memory as shared and intercept accesses – Replicating SHM is slow

12 Liu, Curtsinger, Berger: DThreads: Efficient Deterministic Multithreading, OSDI 2011 13 Olszewski, Ansel, Amarasinghe: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 ASTEROID slide 16 of 18

slide-36
SLIDE 36

Deterministic Multithreading

  • Related work: Debugging multithreaded programs
  • Strong Determinism: All accesses to shared resources

happen in the same order12.

– Consider all memory as shared and intercept accesses – Replicating SHM is slow

  • Weak Determinism: All lock acquisitions in a program

happen in the same order13

– Intercept calls to pthread mutex {lock,unlock}

12 Liu, Curtsinger, Berger: DThreads: Efficient Deterministic Multithreading, OSDI 2011 13 Olszewski, Ansel, Amarasinghe: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 ASTEROID slide 16 of 18

slide-37
SLIDE 37

Replication and Device I/O

Special handling for hardware accesses:

  • Asynchronous events

– OS kernel design: IRQs delivered synchronously – Replication easy

  • Non-idempotent memory accesses

– Expect weird behavior when reading I/O memory multiple times – Similar to the shared-memory problem

ASTEROID slide 17 of 18

slide-38
SLIDE 38

Conclusion

  • ASTEROID Phase 1

– Transparent Replication as OS Service – Hardware support to improve state comparison – Initial Real-Time Analysis (cont. in P2)

  • ASTEROID Phase 2

– Multithreaded Replication – Dealing with I/O and Shared Memory – Protecting non-CPU hardware components

ASTEROID slide 18 of 18

slide-39
SLIDE 39

Nothing to see here

This slide intentionally left blank. Except for above text.

ASTEROID slide 19 of 18

slide-40
SLIDE 40

Romain: Structure

Master

ASTEROID slide 20 of 18

slide-41
SLIDE 41

Romain: Structure

Replica Replica Replica Master

ASTEROID slide 20 of 18

slide-42
SLIDE 42

Romain: Structure

Replica Replica Replica Master =

ASTEROID slide 20 of 18

slide-43
SLIDE 43

Romain: Structure

Replica Replica Replica Master System Call Proxy Resource Manager =

ASTEROID slide 20 of 18

slide-44
SLIDE 44

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1

ASTEROID slide 21 of 18

slide-45
SLIDE 45

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2

ASTEROID slide 21 of 18

slide-46
SLIDE 46

Resource Management: Capabilities

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2 1 2 3 4 5 6 Master

ASTEROID slide 21 of 18

slide-47
SLIDE 47

Partitioned Capability Tables

1 2 3 4 5 6 Replica 1 1 2 3 4 5 6 Replica 2 1 2 3 4 5 6 Master Marked used Master private

ASTEROID slide 22 of 18

slide-48
SLIDE 48

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

ASTEROID slide 23 of 18

slide-49
SLIDE 49

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

ASTEROID slide 23 of 18

slide-50
SLIDE 50

Replica Memory Management

Replica 1 rw ro ro Replica 2 rw ro ro Master

ASTEROID slide 23 of 18

slide-51
SLIDE 51

Overhead vs. Unreplicated Execution

14 14 D ¨

  • bel, H¨

artig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012 ASTEROID slide 24 of 18

slide-52
SLIDE 52

Romain Lines of Code

Base code (main, logging, locking) 325 Application loader 375 Replica manager 628 Redundancy 153 Memory manager 445 System call proxy 311 Shared memory 281 T

  • tal

2,518 Fault injector 668 GDB server stub 1,304

ASTEROID slide 25 of 18