An Approach to Manage Reconfiguration in Fault- Tolerant - - PowerPoint PPT Presentation

an approach to manage reconfiguration in fault tolerant
SMART_READER_LITE
LIVE PREVIEW

An Approach to Manage Reconfiguration in Fault- Tolerant - - PowerPoint PPT Presentation

An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s Stefano Porcarelli 1 , Marco Castaldi 2 , Felicita Di Giandomenico 1 , Andrea Bondavalli 3 , Paola Inverardi 2 1 Italian National Research Council, ISTI Dept, Italy


slide-1
SLIDE 1

May 3rd, WADS 2003 1

An Approach to Manage Reconfiguration in Fault- Tolerant Distributed System s

Stefano Porcarelli1, Marco Castaldi2, Felicita Di Giandomenico1, Andrea Bondavalli3, Paola Inverardi2

1 Italian National Research Council, ISTI Dept, Italy

stefano.porcarelli@ guest.cnuce.cnr.it, digiandomenico@ iei.pi.cnr.it

2 University of L'Aquila, Dip. Informatica, Italy

{ castaldi, inverard} @ di.univaq.it

3 University of Florence, Dip. Sistemi e Informatica, Italy

a.bondavalli@ dsi.uni.it

slide-2
SLIDE 2

May 3rd, WADS 2003 2

Motivations

  • Large distributed systems live for several years
  • Environmental events and component’s faults may

affect workload and functionalities of the system

  • High availability and reliability of critical systems

System reconfiguration to react to faults, to manage system’s life and to provide dependability properties

slide-3
SLIDE 3

May 3rd, WADS 2003 3

System Reconfigurations

  • Dynamic: the reconfiguration must be performed while the

system is running, without service interruption

  • Automatic: the reconfiguration may be triggered as a reaction

for a specified event, issued by a human administrator or an automatic Decision Maker

  • Distributed: the reconfiguration is performed on distributed

systems In particular, we address:

  • Component Reconfiguration: any change of the component

parameters (component re-parametrization)

  • Application Reconfiguration: any architecture’s modification

in terms of topology, component’s number and location

slide-4
SLIDE 4

May 3rd, WADS 2003 4

Our Approach to (Fault) Reconfiguration

Managed System

Lira monitors the system, detects faults and notifies the Decision Maker For each fault pattern, a set of reconfigurations is specified Decision Maker

DM performs the evaluation DM orders the reconfiguration Lira reconfigures the system

  • We propose to use Lira, an infrastructure created to perform dynamic

reconfiguration, enriched with a model-based Decision Maker

slide-5
SLIDE 5

May 3rd, WADS 2003 5

  • The decision making capability is decomposed in a

hierarchical fashion:

– Favoring fault-tolerance by distribution of control – Avoiding heavy computation and coordination activity whenever faults can be managed at local level – Facilitating the construction and on-line solution of analytical models – Favoring scalability

Our Approach to (Fault) Reconfiguration

slide-6
SLIDE 6

May 3rd, WADS 2003 6

Lira Architecture

  • Lira Management Infrastructure

– Light-weight Infrastructure for Reconfiguring Applications – Lira is based on:

  • Agents
  • MIB (Management Information Base)
  • Management Protocol

MI B

Comp

Component Agent

Human Administrator

Manager

Management Protocol

slide-7
SLIDE 7

May 3rd, WADS 2003 7

Enriched Lira Architecture

Comp Agent Component

MIB

Agent Host

MIB

Decision Maker Decision Maker Application Agent

MIB

Manager Decision Maker

MIB

Management Protocol

Host

  • Lira uses a different agent for each hierarchical level:

– Component, Host, Application, Manager agent

  • Each agent is enriched with a decision maker

– Decision making capabilities depend on the hierarchical level of the agent

slide-8
SLIDE 8

May 3rd, WADS 2003 8

Decision Maker

Up Down Degraded

  • Model-Based Decision Maker

– The dynamic topology of the system and the number of managed faults demand for statistical decisions capabilities – Combinatorial and Petri net like models (for complex relationships among components) help to take the most appropriate decision – The possible reconfiguration

  • ptions are pre-planned:

models allow deciding each time which is the most appropriate one The component’s state is modeled by using three states :

  • Up
  • Degraded
  • Down
slide-9
SLIDE 9

May 3rd, WADS 2003 9

A Case Study

  • Distributed computing where

peer-to-peer clients on the network are communicating

  • Path redundancy is used to

prevent service’s interruption

H1 H2 H3 H4 H6 H5 Net 2 Net 1

H 1 H 2

client Client

H 6

5

H

N1

N3 N4

N2

1

Net

a c d e b g f

b-N2-d-N3-f

4

b-N2-e-N4-g

3

a-N1-c-N3-d-N2-e-N4-g

2

a-N1-c-N3-f

1 Route Path

slide-10
SLIDE 10

May 3rd, WADS 2003 10

A Case Study (cont)

Manager

AA 1

HA 2 HA 1 H 1 H 2

N1 A1 N2 A2

A3 N3 N4 A4

client Client

H 6

5

H

AA2

Net 1 Net 2

  • Component agent

– HEALTH_STATE – CONNECTED_NODE – Function to connect different nodes – Functions to control the node

  • Host agent

– HEALTH_STATE – CONNECTED_HOST – Functions to install and activate nodes

  • Application Agent

– AVAILABLE_PATHS – ACTIVE_NODES – ACTIVE_HOSTS – Functions provided by the Host agents

  • Manager Agent

– ACTIVE_HOSTS – Functions provided by the Application agents

slide-11
SLIDE 11

May 3rd, WADS 2003 11

An Exam ple

  • Let suppose that node N3 starts to

work in degraded manner

  • The associated agent A3 notifies at

the upper level AA1

  • The agent AA1 checks the path

availability on the controlled network

  • Three different reconfiguration
  • ptions are possible:

– Continuing to work in degraded manner – Temporarily bypassing node N3 and waiting for its restart – Activate a new node for substituting N3

Manager

AA 1

HA 2 HA 1 H 1 H 2

N1 A1 N2 A2

A3 N3 N4 A4

client Client

H 6

5

H

AA2

Net 1 Net 2

slide-12
SLIDE 12

May 3rd, WADS 2003 12

An Exam ple

  • Three different

reconfiguration options are possible:

– Continuing to work in degraded manner – Temporarily bypassing node N3 and waiting for its restart – Activate a new node for substituting N3

  • The best reconfiguration

consists in restarting N3

4.77510 * 10-8 Set-up a new path 5.19695 * 10-9 Restart node N3 1.73848 * 10-8 Working in degraded manner

PF Policy Options

5 * 10-3 Restarted and new 10-2 Degraded state 10-3 Up state

Failure Probability Link or component status

slide-13
SLIDE 13

May 3rd, WADS 2003 13

Conclusions

  • An architecture for dependability provision has been
  • proposed. It is based on:

– Lira – Model-based Decision Maker

  • We concentrate on system reconfiguration as

consequence of faults (both sw and hw)

  • Hierarchical approach
slide-14
SLIDE 14

May 3rd, WADS 2003 14

Future Work

  • Lira infrastructure has to be fault-tolerant itself
  • Development of Petri net based decision maker

(combinatorial models are not able to handle complex scenarios)

– Dependencies among components – Account for Time – Repairing of components

  • Development of a prototype

– Experimental measurements