Evaluation of Dependable Layered Systems with a Fault Management - - PowerPoint PPT Presentation

evaluation of dependable layered systems with
SMART_READER_LITE
LIVE PREVIEW

Evaluation of Dependable Layered Systems with a Fault Management - - PowerPoint PPT Presentation

WADS Workshop at ICSE 2002 Woodside 1 Evaluation of Dependable Layered Systems with a Fault Management Architecture Olivia Das, C. Murray Woodside Dept. of Systems and Computer Engineering, Carleton University, Ottawa, Canada email:


slide-1
SLIDE 1

WADS Workshop at ICSE 2002 Woodside 1

Evaluation of Dependable Layered Systems with a Fault Management Architecture

Olivia Das, C. Murray Woodside

  • Dept. of Systems and Computer Engineering, Carleton University, Ottawa,

Canada

email: odas@sce.carleton.ca, cmw@sce.carleton.ca

slide-2
SLIDE 2

WADS Workshop at ICSE 2002 Woodside 2

Layered System Model

Tasks, Interactions and Dependencies, and Processors

......Configuration depends on Failure State userA

UserA

userB

UserB AppA AppB

Server1

Server2 eA

eA-1

eA-2 eB-1

eB-2 procA

procB proc1 proc2

proc3

proc4 serviceA #1

#1

#2 #2

NUserA = 50

NUserB = 100

eB serviceB

slide-3
SLIDE 3

WADS Workshop at ICSE 2002 Woodside 3

Example Configuration (1) ... failure compensated by standby servers

Processor 3 fails and puts Server1 out... Server2 used instead

userA

UserA

userB

UserB AppA AppB

Server1

Server2 eA

eA-1

eA-2 eB-1

eB-2 procA

procB proc1 proc2

proc3

proc4 serviceA #1

#1

#2 #2

NUserA = 50

NUserB = 100

eB serviceB

slide-4
SLIDE 4

WADS Workshop at ICSE 2002 Woodside 4

Example Configuration (2) ... failure cannot be compensated by standby servers

Processor 2 fails and puts Application1 out... Group Users1 is off the air.... performability measure is reduced

userA

UserA

userB

UserB AppA AppB

Server1

Server2 eA

eA-1

eA-2 eB-1

eB-2 procA

procB proc1 proc2

proc3

proc4 serviceA #1

#1

#2 #2

NUserA = 50

NUserB = 100

eB serviceB

slide-5
SLIDE 5

WADS Workshop at ICSE 2002 Woodside 5

Fault Propagation Graph....

used to find the configuration states, add up their probabilities

r userA userB UserB procB UserA procA eA eB AppA proc1 serviceA AppB proc2 serviceB eA-1 eA-2 eB-1 eB-2 proc3 Server1 Server2 proc4 #1 #2 #1 #2

slide-6
SLIDE 6

WADS Workshop at ICSE 2002 Woodside 6

Management Subsystem

  • Reaction delays
  • Management subsystem failures and repairs

Manager Application Agent Agent Subagent Server1 Server2

Agent

slide-7
SLIDE 7

WADS Workshop at ICSE 2002 Woodside 7

Specifying a Management Architecture

AppA:AT ag1:AGT

proc1:Proc c1:AW c5:Ntfy

AppB:AT ag2:AGT

proc2:Proc c2:AW c6:Ntfy

Server1:AT ag3:AGT

proc3:Proc c3:AW

Server2:AT ag4:AGT

proc4:Proc c4:AW c11:AW c12:SW c13:Ntfy c16:Ntfy c14:AW c15:SW c7:AW c8:SW c9:AW c10:SW

m1:MT

proc5:Proc

Elements Components

  • Application pro-

cesses

  • Management

Agents

  • Managers

Connectors

  • Alive-watch
  • Status-watch
  • Notifier
slide-8
SLIDE 8

WADS Workshop at ICSE 2002 Woodside 8

Functionality

Application process status is monitored by its local agent (Alive-watch connection) Processor status is monitored by a Manager on another node, ... e.g. by pinging System wide status is gathered by Managers (Status connections) .... and distributed back to Agents (Notify connections) Application process reconfiguration is triggered by the agent on its node (Notification connection) .... e.g. to switch to a standby server, or to restart a process Capability to reconfigure is conditioned by “Knowledge” of the status

  • f the system

.... that is, by the Management Architecture and its failures

slide-9
SLIDE 9

WADS Workshop at ICSE 2002 Woodside 9

Analysis.... currently....

* Markov model for component failures and repairs .... (e.g., independent failure of processors and processes) * Derive configurations and their probabilities ....Additional configurations that include Management Subsystem failure * Reconfiguration capability is limited by “Knowledge” of the status, and thus by the Management Subsystem state .... thus, additional delays to repair * Analyse the performance of each configuration .... assemble measures based on configuration probabilities .... related to work by Haverkort with queueing models and server failures .... here, extended with layered dependencies for failure, and layered queuing models for performance * Consider bounds and approximations

slide-10
SLIDE 10

WADS Workshop at ICSE 2002 Woodside 10

Conclusions

Scalable technique ... separation of performance-level analysis from failure repair ... analysis of effective configurations gives a MUCH smaller set of configurations, than of failure states. Even so, explosion of configurations is a limitation.... Publications..... www.sce.carleton.ca/faculty/woodside