An Approach to Manage Reconfiguration in Fault- Tolerant - PowerPoint PPT Presentation

An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s Stefano Porcarelli 1 , Marco Castaldi 2 , Felicita Di Giandomenico 1 , Andrea Bondavalli 3 , Paola Inverardi 2 1 Italian National Research Council, ISTI Dept, Italy stefano.porcarelli@ guest.cnuce.cnr.it, digiandomenico@ iei.pi.cnr.it 2 University of L'Aquila, Dip. Informatica, Italy { castaldi, inverard} @ di.univaq.it 3 University of Florence, Dip. Sistemi e Informatica, Italy a.bondavalli@ dsi.uni.it May 3rd, WADS 2003 1

Motivations • Large distributed systems live for several years • Environmental events and component’s faults may affect workload and functionalities of the system • High availability and reliability of critical systems System reconfiguration to react to faults, to manage system’s life and to provide dependability properties May 3rd, WADS 2003 2

System Reconfigurations • Dynamic: the reconfiguration must be performed while the system is running, without service interruption • Automatic: the reconfiguration may be triggered as a reaction for a specified event, issued by a human administrator or an automatic Decision Maker • Distributed : the reconfiguration is performed on distributed systems In particular, we address: • Component Reconfiguration: any change of the component parameters ( component re-parametrization ) • Application Reconfiguration: any architecture’s modification in terms of topology, component’s number and location May 3rd, WADS 2003 3

Our Approach to (Fault) Reconfiguration •We propose to use Lira , an infrastructure created to perform dynamic reconfiguration, enriched with a model-based Decision Maker Decision Maker Managed System Lira monitors the system, detects faults and notifies the For each fault pattern, Decision Maker a set of reconfigurations is specified DM performs Lira reconfigures the evaluation the system DM orders the reconfiguration May 3rd, WADS 2003 4

Our Approach to (Fault) Reconfiguration • The decision making capability is decomposed in a hierarchical fashion: – Favoring fault-tolerance by distribution of control – Avoiding heavy computation and coordination activity whenever faults can be managed at local level – Facilitating the construction and on-line solution of analytical models – Favoring scalability May 3rd, WADS 2003 5

Lira Architecture • Lira Management Infrastructure – Light-weight Infrastructure for Reconfiguring Applications – Lira is based on: • Agents • MIB (Management Information Base) • Management Protocol Human Administrator Component Manager Agent Management Comp Protocol MI B May 3rd, WADS 2003 6

Enriched Lira Architecture • Lira uses a different agent for each hierarchical level: – Component, Host, Application, Manager agent • Each agent is enriched with a decision maker – Decision making capabilities depend on the hierarchical level of the agent Decision Component Maker Agent Application Comp Agent Decision MIB Maker MIB Host Host Manager Agent Management Protocol MIB Decision MIB Maker May 3rd, WADS 2003 7

Decision Maker • Model-Based Decision Maker – The dynamic topology of the Up Degraded system and the number of managed faults demand for Down statistical decisions capabilities – Combinatorial and Petri net like models (for complex relationships among The component’s state is modeled components) help to take the by using three states : most appropriate decision • Up – The possible reconfiguration options are pre-planned: • Degraded models allow deciding each • Down time which is the most appropriate one May 3rd, WADS 2003 8

A Case Study • Distributed computing where H 2 Net 1 H 1 peer-to-peer clients on the network are communicating • Path redundancy is used to H 6 prevent service’s interruption H 5 Net 2 H 3 H 4 Net 1 N 3 Path Route c f N 1 d 1 a-N 1 -c-N 3 -f a g N 4 client 2 a-N 1 -c-N 3 -d-N 2 -e-N 4 -g Client H 2 N 2 e b H H 6 5 3 H 1 b-N 2 -e-N 4 -g 4 b-N 2 -d-N 3 -f May 3rd, WADS 2003 9

A Case Study (cont) • Component agent – HEALTH_STATE AA 1 – CONNECTED_NODE – Function to connect different nodes HA 2 – Functions to control the node A 3 N 3 HA 1 • Host agent Manager – HEALTH_STATE N 1 – CONNECTED_HOST A 1 A 4 N 4 client – Functions to install and activate nodes Client H 2 N 2 • Application Agent A 2 H H 6 5 H 1 Net 1 – AVAILABLE_PATHS – ACTIVE_NODES – ACTIVE_HOSTS – Functions provided by the Host agents Net 2 AA 2 • Manager Agent – ACTIVE_HOSTS – Functions provided by the Application agents May 3rd, WADS 2003 10

An Exam ple • Let suppose that node N 3 starts to work in degraded manner AA 1 • The associated agent A 3 notifies at HA 2 the upper level AA 1 A 3 N 3 HA 1 • The agent AA 1 checks the path Manager availability on the controlled N 1 A 1 network A 4 N 4 client Client H 2 N 2 • Three different reconfiguration A 2 H H 6 5 H 1 options are possible: Net 1 – Continuing to work in degraded manner – Temporarily bypassing node Net 2 AA 2 N 3 and waiting for its restart – Activate a new node for substituting N 3 May 3rd, WADS 2003 11

An Exam ple Link or Failure •Three different component Probability reconfiguration options are status possible: 10 -3 Up state – Continuing to work in 10 -2 Degraded state degraded manner 5 * 10 -3 Restarted and new – Temporarily bypassing node N 3 and waiting for its restart Policy Options P F – Activate a new node for substituting N 3 Working in 1.73848 * 10 -8 degraded manner • The best reconfiguration 5.19695 * 10 -9 Restart node N 3 consists in restarting N 3 4.77510 * 10 -8 Set-up a new path May 3rd, WADS 2003 12

Conclusions • An architecture for dependability provision has been proposed. It is based on: – Lira – Model-based Decision Maker • We concentrate on system reconfiguration as consequence of faults (both sw and hw) • Hierarchical approach May 3rd, WADS 2003 13

Future Work • Lira infrastructure has to be fault-tolerant itself • Development of Petri net based decision maker (combinatorial models are not able to handle complex scenarios) – Dependencies among components – Account for Time – Repairing of components • Development of a prototype – Experimental measurements May 3rd, WADS 2003 14

An Approach to Manage Reconfiguration in Fault- Tolerant - PowerPoint PPT Presentation

An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s Stefano Porcarelli 1 , Marco Castaldi 2 , Felicita Di Giandomenico 1 , Andrea Bondavalli 3 , Paola Inverardi 2 1 Italian National Research Council, ISTI Dept, Italy

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Existing Facility Reconfiguration Needs Assessment SHERMAN POLICE FACILITY RECONFIGURATION

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Andrew L. Baldwin, BS 09,

Fault-tolerant Quantum Computing Bryan Eastin Northrop Grumman Corporation Aurora, CO December

Computer Systems Research Kexin Rong CS197 09/26/19 Agenda - Area overview - Introductions

TDDD82 Secure Mobile Systems Lecture 5: Dependability Mikael Asplund Real-tjme Systems

Hypervisor-Based Fault-Tolerance Thomas C. Bressoud, Isis

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Dian Yu 1/16 Comparison with

THE RELIABLE COMPUTING BASE A Paradigm for Software-Based Reliability Michael Engel (TU

ERLANG/OTP Torben Ho fg mann Erlang Solutions @LeHo fg torben@erlang-solutions.com

Fault-Tolerant Resource Reasoning Gian Ntzik , Pedro da Rocha Pinto and Philippa Gardner Imperial

Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution Ashis Tarafdar Vijay