Surviving Sensor Network Software Faults Presented by Jacek Migdal - - PowerPoint PPT Presentation

surviving sensor network software faults
SMART_READER_LITE
LIVE PREVIEW

Surviving Sensor Network Software Faults Presented by Jacek Migdal - - PowerPoint PPT Presentation

Surviving Sensor Network Software Faults Presented by Jacek Migdal Software crashes Software bugs are common: tests may not reveal rare problems hard to identify and fix ... but sensor network should be able to work for years. Ariane


slide-1
SLIDE 1

Surviving Sensor Network Software Faults

Presented by Jacek Migdal

slide-2
SLIDE 2

Software crashes

Software bugs are common:

  • tests may not reveal rare problems
  • hard to identify and fix

... but sensor network should be able to work for years.

Ariane 5 Flight 501

slide-3
SLIDE 3

Common approach

Have you tried rebooting?

slide-4
SLIDE 4

Rebooting on failure

  • works in most cases (memory faults)
  • recent data is lost
  • time consuming => reduce availability
  • cause additional cost for routing

protocols

slide-5
SLIDE 5

Proposed solution: Neutron

Divide software into recovery units and reboots the faulty unit.

slide-6
SLIDE 6

Hardware

  • 1-8 MHz
  • 4-10 kB SRAM
  • 40-128 KB flash memory
  • without hardware memory isolation

Low overhead solution is needed.

slide-7
SLIDE 7

Architecture

Safe TinyOS TOSThreads Neutron extensions Deputy compiler

Compiler

Neutron recovery code

TinyOS

slide-8
SLIDE 8

Recovery unit

Definition:

  • application recovery unit

may not call directly into a different recovery unit

  • instanties at least one thread

(kernel has exactly one)

  • every component belongs at

most to one application recovery unit or to kernel recovery unit

Recovery unit:

  • application
  • kernel
slide-9
SLIDE 9

How to divide program into recovery unit

  • Use annotations to define kernel

boundaries (@syscall_base, @syscall_ext)

  • Use Deputy compiler to divide

program into recovery unit and isolate them

slide-10
SLIDE 10

How to recover application unit

  • 1. Cancel system

calls and halt threads (pending flag)

  • 2. Reclaim

allocated memory

  • 3. Re-initialize

application unit RAM

  • 4. Restart the

application unit thread

slide-11
SLIDE 11

How to recover kernel unit

  • 1. Cancel
  • utstanding

system calls

  • 2. Save application

dependent state.

  • 3. Reboot the

TinyOS (skip thread state initialization)

  • 4. Restart the

saved state.

slide-12
SLIDE 12

Precious state

  • Losing state of application is too

costly.

  • Maintain variable value across

application unit restart (mark them with @precious flag).

slide-13
SLIDE 13

Precious state

Recovery: 1.Check for corruption 2.Push to stack 3.Re-initialize recovery unit 4.Pop from stack and copy Features: 1.Groups 2.Atomic

  • perations

3.(Optional) Check integrity on application level 4.Pop from stack and copy

slide-14
SLIDE 14

Evaluation availability

slide-15
SLIDE 15

Evaluation routing protocol cost

slide-16
SLIDE 16

Evaluation - overhead

Low programmer overhead (mostly cost of adding annotations)

slide-17
SLIDE 17

Related work

  • kernel level safety (most OS, using virtual

address space)

  • language-level safety
  • micro reboots (Java Enterprise Edition)
slide-18
SLIDE 18

Conclusion

Neutron:

  • recovers from memory safety bugs
  • divide program into recovery unit
  • re-initialize faulty unit on error
  • implement as part of compiler and TinyOS
  • designed for limited architecture
  • reduce time to synchronization by 94% and

cost of routing protocol by 99.5%

slide-19
SLIDE 19

References

  • Y. Chen, O. Gnawali, M. Kazandjieva, P

. Levis,

  • J. Regehr: “Surviving Sensor Network

Software Faults,” in Proceedings ACM SOSP 2009, Big Sky, MT, USA, October 2009.

  • Image sources:
  • http://top10latest.com/top-10-costliest-software-bugs
  • http://www.personal.kent.edu/~rmuhamma
  • http://store.fungizmos.com
  • http://omrumfuneraltransport.com
  • http://www.moddergamer.com