University of Paderborn
Software Engineering Group
- Prof. Dr. Wilhelm Schäfer
Daniela Schilling – May 2005
Computing Optimal Self- Repair Actions: Damage Minimization versus Repair Time
Matthias Tichy, Holger Giese, Daniela Schilling, Wladimir Pauls
Computing Optimal Self- Repair Actions: Damage Minimization versus - - PowerPoint PPT Presentation
University of Paderborn Software Engineering Group Prof. Dr. Wilhelm Schfer Computing Optimal Self- Repair Actions: Damage Minimization versus Repair Time Matthias Tichy, Holger Giese, Daniela Schilling, Wladimir Pauls Daniela Schilling
University of Paderborn
Software Engineering Group
Daniela Schilling – May 2005
Matthias Tichy, Holger Giese, Daniela Schilling, Wladimir Pauls
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 2
Motivation
www.railcab.de
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 3
Redundant implementations of important software
components
Required: reconfiguration Given: automatism to detect failed components Self-Repair Actions: automatic calculation of redeployment
for failed components
pc2:Position Calculation
Avalon Taliesin Uther Gareth Gorlois Arthur
vot:Voter gps:GPS- Controller cc:Convoy mul:Multiplier pc3:Position Calculation
Motivation
pc1:Position Calculation
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 4
Initial Deployment
UML Deployment Diagrams to inequalities over boolean and integer variables
deployment
WOSS/FSE 2004: Matthias Tichy, Daniela Schilling, Holger Giese: Design of Self-Managing Dependable Systems with UML and Fault Tolerance Patterns pc1:Position Calculation pc2:Position Calculation Node1: Node2: pc1.mem=2.0Mb
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 5
Online Redeployment
node fail too
calculate redeployment perform redeployment
damage time
Costs
Failed components Components to be migrated
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 6
Online Redeployment
system
damage time
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 7
Online Redeployment
caused by migration of running componets) to the constraint system
damage time
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 8
Online Redeployment
constraint system
components only
components that have to be redeployed/migrated
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 9
Online Redeployment
damage time
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 10
Choosing Components for Redeployment
component
redundant copies of failed components
already failed
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 11
Choosing Components for Redeployment
component
redundant copies of failed components
already failed
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 12
Experiment
Scenario:
36 nodes with 114 links 72 components with 99 connectors 5 node-specific (CPU, OS, Memory, Utilization,
HDD) and 2 link-specific (Bandwidth, Loss) deployment restrictions
set of deployment constraints on components and
connectors
Experiment:
Randomly selected a
node and let it fail
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 13
Experimental Results
34 5 30 7 Damage 31 1 29 N/A Damage 34 4 97 773 Damage 50 16430 13660 4 10 14920 13790 3 30 56060 14890 2 50 > 1h 13630 1 Time (ms) Time (ms) Time (ms) Our Algorithm
Test Nr. damage time
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 14
Algorithm to calculate optimal self-repair
actions
Deployment constraints solved by standard
constraint solver
Experiment showed that algorithm is nearly
consumption
Not presented: pre-solving step Communication and monitoring framework Describe repair rules by graph transformation
systems
Conclusion & Future Work
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 15
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 16
pc2:Position Calculation
Avalon Taliesin Uther Gareth Gorlois Arthur
vot:Voter gps:GPS- Controller cc:Convoy mul:Multiplier pc3:Position Calculation pc1:Position Calculation
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 17
Avalon
Mem=2.5Mb
Taliesin
Mem=1.5Mb
Uther
Mem=1Mb
Gareth
Mem=2Mb
Gorlois
Mem=2Mb
Arthur
Mem=1.5Mb gps:GPS- Controller Mem=0.5Mb vot:Voter Mem:0.5Mb cc:Convoy Mem=0.7Mb pc2:Position Calculation Mem=1.5Mb mul:Multiplier Mem=0.25Mb pc3:Position Calculation
Example
pc1:Position Calculation Mem=2Mb pc1:Position Calculation Mem=2Mb pc2:Position Calculation Mem=1.5Mb
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 18
C1 C5 C2 C3 C4
n1 n5 n3 n2 n4
damage=13 damage=13 damage: all=13 2of3=4 1of3=1
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 19
Initial situation a b c d e f g Failed components Running components 1) a b c d e f g Submodel: Consider: Consider later: Submodel not solvable 2) a b c d e f g Redundant copies 3) a b c d e f g Not related Submodel not solvable 4) a b c e d f g
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 20
Failed components Running components 4) a b c e Submodel not solvable d f g 5) a b c e d f g Redundant copies 6) a b c e d f g 7) a b c e f g d Submodel solvable
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 21
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 22
Foundations (TMR)
Use fault tolerance techniques to ensure
dependability
Triple Modular Redundancy (TMR)
:Multiplier :Component2 :Provider :User :Voter :Component3 :Component1
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 23
{ Node3.CPU Node4.CPU Node4.CPU Node5.CPU Node3.CPU Node 5.CPU }
Foundations (TMR)
Deployment constraints for TMR
Avoid crash failures
components to distinct nodes Avoid single-point-
multiplier
user to same node
(if the user fails, the failure of the voter is no problem)
Heterogeneous hardware platform
:Multiplier :Component2 :Provider :User :Voter :Component3 :Component1 Node1: Node2: Node3: Node4: Node5:
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 24
www. .de
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 25
Online Redeployment
components
(reduce damage)
University of Paderborn
Software Engineering Group
Daniela Schilling - May 2005- 26