distributed real time fault tolerance on a virtualized
play

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core - PowerPoint PPT Presentation

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: { missimer,richwest,liye } @cs.bu.edu *VMware, Inc. Eric


  1. Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: { missimer,richwest,liye } @cs.bu.edu *VMware, Inc. Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 1

  2. Quest-V: Virtualized Multi-Core System Quest-V Background: Boston University’s in house operating system + hypervisor Developed for real-time and high-confidence systems Key Features: Virtualized Separation Kernel Simplified Hypervisor: Sandboxes are pinned to cores at boot, no need for scheduling I/O devices are partitioned amongst sandboxes, not shared or emulated Virtualization used for encapsulation Assume hypervisor is a trusted code base Communication through explicit shared memory channels Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 2

  3. Quest-V Design Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 3

  4. Motivation Safety critical systems requires component isolation and redundancy Integrated Modular Avionics (IMA), Automobiles Multi-/many-core processors are increasingly popular in embedded systems Multi-core processors can be used to consolidate redundant services onto a single platform Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 4

  5. Motivation Many processors now feature hardware virtualization ARM Cortex A15, Intel VT-x, AMD-V Hardware virtualization provides opportunity to efficiently partition resources amongst guest VMs Not trying to remove all hardware redundancy – just lessen it Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 5

  6. Motivation Many processors now feature hardware virtualization ARM Cortex A15, Intel VT-x, AMD-V Hardware virtualization provides opportunity to efficiently partition resources amongst guest VMs Not trying to remove all hardware redundancy – just lessen it H/W Virtualization + Resource Partitioning/Isolation = Platform for Embedded Safety Critical Systems Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 5

  7. Motivation Focusing on hardware transient faults and software timing faults Random bit flips from caused by radiation Asynchronous bugs in faulty device drivers Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 6

  8. Quest-V N-Modular Redundancy N redundant copies of a program, one per sandbox (at least three) At least one voter Hash based fault detection and recovery Virtualized separation kernel platform provides new n-modular redundancy configurations Software based dual core lock step (DCLS) Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 7

  9. N-Modular Redundancy Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 8

  10. N-Modular Redundancy for Real-Time Applications Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 9

  11. Fault Detection Typical n-modular redundancy compares the output of the computation Pro: Fast Con: Don’t know what went wrong Proposed detection method: compare application memory on a per page basis via hashes Pro: Faster and generic recovery for complicated applications (discussed later) Con: Must hash memory state of process (slow) Can speed on comparison using a “summary” hash Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 10

  12. Fault Detection Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 11

  13. N-Modular Redundancy Configurations Voting mechanism and device driver in the hypervisor Voting mechanism and device driver in one sandbox Voting mechanism distributed across sandboxes and device driver is shared Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 12

  14. Voting Mechanism and Device Driver in the Hypervisor Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 13

  15. Voting Mechanism and Device Driver in the Hypervisor Pros: No need to modify operating system - could apply to Linux as well as Quest Need only n sandboxes Cons: Conflicts with Quest-V hypervisor design Faulty device driver could jeopardize the entire system Need to duplicate the entire guest Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 14

  16. Voting Mechanism and Device Driver in One Sandbox Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 15

  17. Voting Mechanism and Device Driver in One Sandbox Pros: Simpler hypervisor Application level redundancy, don’t need to copy the entire sandbox Cons: Need ( n +1) sandboxes Need to modify guest Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 16

  18. Voting is Distributed and Device Driver is Shared Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 17

  19. Voting is Distributed and Device Driver is Shared Pros: Need only n sandboxes Application level redundancy, don’t need to copy the entire sandbox Cons: Need to modify guest Complicated shared device driver Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 18

  20. Recovery Want recovery to be as generic as possible Simple applications – rebooting might be sufficient Complicated applications – rebooting could cause important state to be lost Perform live migrations of either application or guest machine Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 19

  21. Recovery All performed within the context of the thread’s sporadic server Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 20

  22. Quick Summary - Key Points to Take Away Per-page hash based fault detection and recovery Three n-modular redundancy configurations in a virtualized separation kernel Hypervisor Voting Sandbox Voting Distributed Voting Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 21

  23. Conclusion So what’s left? Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

  24. Conclusion So what’s left? Further implementation and comparison Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

  25. Conclusion So what’s left? Further implementation and comparison Figure out solution for voter single point of failure: Possibilities include arithmetic encoding and memory scrubbing Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 22

  26. Conclusion More Info: www.questos.org Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 23

  27. Conclusion More Info: www.questos.org Questions? Eric Missimer, Richard West and Ye Li Real-Time Fault Tolerance 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend