SLIDE 1
Ashis Tarafdar Vijay K. Garg
ashis@cs.utexas.edu garg@ece.utexas.edu Parallel and Distributed Systems Laboratory Department of Electrical and Computer Engineering University of Texas at Austin Austin, USA 78712 http://maple.ece.utexas.edu
Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution
SLIDE 2 Introduction
Software Fault Tolerance:
to ensure that the system continues normal operation despite the presence
software faults cause software failures
SLIDE 3
Goals
A new approach to software fault tolerance The predicate control problem: introduction and results
SLIDE 4
Background: Software Fault Tolerance
The Progressive Retry Approach: [Wang et al, 1997]
software failures are often transient rollback and re-execute no guarantees
SLIDE 5 Background: Races in Concurrent Programs
What is a race?
A race occurs when two processes can concurrently access the same shared resource.
critical section synchronization cs1 cs1 cs2 cs2 a a b b
A race in a concurrent computation A race-free computation
Races are an important class of software faults. [Iyer & Lee, 95]
P1 P2 P2 P1
SLIDE 6 The Controlled Re-execution Approach
- 1. Tracing an execution
- 2. Detecting a race failure
- 3. Determining a control strategy
- 4. Re-executing under control
cs1 cs2 cs3 cs4 P1 P2 P3 cs1 cs2 cs3 cs4 P1 P2 P3 a b c d Traced Computation Controlling Computation added synchronization
SLIDE 7
Model
cs1 cs2 cs3 cs4 P1 P2 P3 d a b e
G H
inconsistent consistent
states computation (happened before) global state consistent global state global predicate (e.g. mutual exclusion)
c f
SLIDE 8 The Off-line Predicate Control Problem
cs1 cs2 cs3 cs4 P1 P2 P3
G
Problem Statement:
Given a computation C and a global predicate B, find a controlling computation of B in C
cs1 cs2 cs3 cs4 P1 P2 P3 a b c d
G
Controlling Computation C'
B = mutual exclusion Computation C
Note: A controlling computation must have no cycles !
SLIDE 9
Off-line Mutual Exclusion
Theorem: The off-line predicate control problem is NP-Hard [Tarafdar & Garg, 98]
Off-line Mutual Exclusion
Variants of Off-line Mutual Exclusion
Off-line Readers Writers Off-line Independent Read-Write Mutual Exclusion Off-line Independent Mutual Exclusion
SLIDE 10 A Relation on Critical Sections
cs1 cs2 iff cs1 starts before cs2 finishes
cs1 cs2 P1 P2 P3 cs1 cs2 P1 P2 cs1 cs2 P1 P2 P3
b c e f d a a a b b c d
SLIDE 11 Off-line Readers Writers: Result
Sufficient: Theorem: For a computation C and a global predicate Brw , a controlling computation of Brw in C exists iff all cycles in contain only read critical sections
cs1 cs2 cs3 P1 P2 P3
strongly connected components
R R R R W R
write critical section
Proof: Key Ideas: Necessary:
SLIDE 12 Off-line Readers Writers: Algorithm
Algorithm 2: O(n2p)
P1 P2 P4
cs1 cs2 cs4 cs6 cs5 cs8
A B
n : number of processes p : number of critical sections in computation Algorithm 3: O(np) Algorithm 1: O(p2) Key Idea: Only "new" CS's need be considered Key Idea: An SCC contains at most one CS per process
P3
cs3 cs7
SLIDE 13
Summary
A new approach to software fault tolerance
introduced the controlled re-execution approach for race faults focussed on the problem of determining a control strategy
The off-line predicate control problem: introduction and results
defined the off-line predicate control problem necessary and sufficient conditions for the off-line readers writers problem O(np) algorithm for the off-line readers writers problem also: other variants of off-line mutual exclusion
SLIDE 14
On-line Mutual Exclusion is Impossible
cs1 cs2 P2 P1
G
cs1 cs2 d b P2 P1
H
cs1 cs2 a b P2 P1
H
c a