Software Fault Tolerance of Concurrent Programs Using Controlled - - PowerPoint PPT Presentation

software fault tolerance of concurrent programs using
SMART_READER_LITE
LIVE PREVIEW

Software Fault Tolerance of Concurrent Programs Using Controlled - - PowerPoint PPT Presentation

Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution Ashis Tarafdar Vijay K. Garg ashis@cs.utexas.edu garg@ece.utexas.edu Parallel and Distributed Systems Laboratory Department of


slide-1
SLIDE 1

Ashis Tarafdar Vijay K. Garg

ashis@cs.utexas.edu garg@ece.utexas.edu Parallel and Distributed Systems Laboratory Department of Electrical and Computer Engineering University of Texas at Austin Austin, USA 78712 http://maple.ece.utexas.edu

Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

slide-2
SLIDE 2

Introduction

Software Fault Tolerance:

to ensure that the system continues normal operation despite the presence

  • f software faults (bugs)

software faults cause software failures

slide-3
SLIDE 3

Goals

A new approach to software fault tolerance The predicate control problem: introduction and results

slide-4
SLIDE 4

Background: Software Fault Tolerance

The Progressive Retry Approach: [Wang et al, 1997]

software failures are often transient rollback and re-execute no guarantees

slide-5
SLIDE 5

Background: Races in Concurrent Programs

What is a race?

A race occurs when two processes can concurrently access the same shared resource.

critical section synchronization cs1 cs1 cs2 cs2 a a b b

A race in a concurrent computation A race-free computation

Races are an important class of software faults. [Iyer & Lee, 95]

P1 P2 P2 P1

slide-6
SLIDE 6

The Controlled Re-execution Approach

  • 1. Tracing an execution
  • 2. Detecting a race failure
  • 3. Determining a control strategy
  • 4. Re-executing under control

cs1 cs2 cs3 cs4 P1 P2 P3 cs1 cs2 cs3 cs4 P1 P2 P3 a b c d Traced Computation Controlling Computation added synchronization

slide-7
SLIDE 7

Model

cs1 cs2 cs3 cs4 P1 P2 P3 d a b e

G H

inconsistent consistent

states computation (happened before) global state consistent global state global predicate (e.g. mutual exclusion)

c f

slide-8
SLIDE 8

The Off-line Predicate Control Problem

cs1 cs2 cs3 cs4 P1 P2 P3

G

Problem Statement:

Given a computation C and a global predicate B, find a controlling computation of B in C

cs1 cs2 cs3 cs4 P1 P2 P3 a b c d

G

Controlling Computation C'

  • f B in C

B = mutual exclusion Computation C

Note: A controlling computation must have no cycles !

slide-9
SLIDE 9

Off-line Mutual Exclusion

Theorem: The off-line predicate control problem is NP-Hard [Tarafdar & Garg, 98]

Off-line Mutual Exclusion

Variants of Off-line Mutual Exclusion

Off-line Readers Writers Off-line Independent Read-Write Mutual Exclusion Off-line Independent Mutual Exclusion

slide-10
SLIDE 10

A Relation on Critical Sections

cs1 cs2 iff cs1 starts before cs2 finishes

cs1 cs2 P1 P2 P3 cs1 cs2 P1 P2 cs1 cs2 P1 P2 P3

b c e f d a a a b b c d

slide-11
SLIDE 11

Off-line Readers Writers: Result

Sufficient: Theorem: For a computation C and a global predicate Brw , a controlling computation of Brw in C exists iff all cycles in contain only read critical sections

cs1 cs2 cs3 P1 P2 P3

strongly connected components

R R R R W R

write critical section

Proof: Key Ideas: Necessary:

slide-12
SLIDE 12

Off-line Readers Writers: Algorithm

Algorithm 2: O(n2p)

P1 P2 P4

cs1 cs2 cs4 cs6 cs5 cs8

A B

n : number of processes p : number of critical sections in computation Algorithm 3: O(np) Algorithm 1: O(p2) Key Idea: Only "new" CS's need be considered Key Idea: An SCC contains at most one CS per process

P3

cs3 cs7

slide-13
SLIDE 13

Summary

A new approach to software fault tolerance

introduced the controlled re-execution approach for race faults focussed on the problem of determining a control strategy

The off-line predicate control problem: introduction and results

defined the off-line predicate control problem necessary and sufficient conditions for the off-line readers writers problem O(np) algorithm for the off-line readers writers problem also: other variants of off-line mutual exclusion

slide-14
SLIDE 14

On-line Mutual Exclusion is Impossible

cs1 cs2 P2 P1

G

cs1 cs2 d b P2 P1

H

cs1 cs2 a b P2 P1

H

c a