[PPT] - Designing fault-diagnosis and reintegration to prevent node PowerPoint Presentation

SLIDE 1

Designing fault-diagnosis and reintegration to prevent node redundancy attrition in highly reliable control systems based on FTT-Ethernet

Sinisa Derasevic, Manuel Barranco, Julián Proenza

Mathematics and Computer Science Department, University of the Balearic Islands (UIB), Spain

1

SLIDE 2

diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet

2

node 1 node 2 node 3 node M ... switch

SLIDE 3

3

node 1 node 2 node 3 node M ... switch

diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet relevant piece of FT4FTT

SLIDE 4

4

node 1 node 2 node 3 node M ... follower switch leader switch

high reliability by tolerating faults at
switch  duplicate
links  duplicate
nodes

SLIDE 5

5

high reliability by tolerating faults at
switch  duplicate
links  duplicate
nodes  actively replicate critical nodes & vote

node 3 node 1 node 2 node M ... follower switch leader switch

SLIDE 6

6

which are the critical nodes?

SLIDE 7

7

plant

C S A node M ... S A controller sensor actuation

which are the critical nodes?

SLIDE 8

8

plant

C S A node M ... S A controller sensor actuation

which are the critical nodes?

system failure

in principle all these nodes can be considered as critical

SLIDE 9

9

plant

C S A node M ... S A controller

which are the critical nodes?

replicate sensor and actuation nodes is trivial

A A S S

SLIDE 10

10

which are the critical nodes?

replicate a controller node is complex: replicas must coordinate among them

plant

A node M ... sensor(s) actuator(s) S S replica 1 ... replica 2 replica N coordinate among them

SLIDE 11

11

synchronize at communication & app. levels
using the Trigger Message (TM)
vote on intermediate results

how do replicas coordinate?

SLIDE 12

12

how do replicas coordinate?

synchronize at communication & app. levels
using the Trigger Message (TM)
vote on intermediate results 

SLIDE 13

13

voting

leader switch follower switch

sense app: control cycle control actuate

replica 1 replica 2 replica 3

SLIDE 14

14

voting

leader switch follower switch

sense control actuate

replica 1 replica 2 replica 3

A A B aquire sensors

app: control cycle

SLIDE 15

15

voting

leader switch

A A B

follower switch replica 1 replica 2 replica 3

exchange sensors aquire sensors A A B

SLIDE 16

16

voting

replica 1 replica 2 leader switch replica 3

vote vote vote

follower switch

exchange sensors aquire sensors vote on sensors A A B A A B A A B A A B

SLIDE 17

17

voting

leader switch

consensus A A A

follower switch replica 1 replica 2 replica 3

exchange sensors aquire sensors vote on sensors A A B A A B A A B A A B

SLIDE 18

18

voting

leader switch follower switch

consensus

replica 1 replica 2 replica 3

sense control actuate

A A A

app: control cycle

SLIDE 19

19

benefits of active node replication with voting ?

SLIDE 20

20

compensate errors

replica 1 replica 2 leader switch replica 3 replica 1 replica 2

✔ e

follower switch

the sytem can correctly deliver its service

SLIDE 21

21

replicas may recover from errors

replica 1 replica 2 leader switch replica 3 replica 1 replica 2

✔ e

replica 3

temporar y

follower switch

if replica 3 can vote replica 3 recovers and keeps participating

SLIDE 22

22

however…

SLIDE 23

23

replica 1 replica 2 leader switch replica 3 follower switch

what if a temporary fault makes a replica to be lost from then on ??

SLIDE 24

24

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

temporary fault affects replica 3 internals or communication capabilities

SLIDE 25

25

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

? ?

temporary fault affects replica 3 internals or communication capabilities replica 3 may desynchronize at the level

f application and/or

communication

SLIDE 26

26

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

I cannot recover !

? ?

replica 3 may desynchronize at the level

f application and/or

communication temporary fault affects replica 3 internals or communication capabilities

SLIDE 27

27

leader switch replica 1 replica 2 replica 3

? ?

node redundancy attrition

×

follower switch

I cannot recover ! replica 3 is not permanently faulty, but can not be used!

SLIDE 28

28

temporary faults are more probable than permanent ones

SLIDE 29

29

if we do not prevent redundancy attrition caused by temporary faults

SLIDE 30

30

then we do not take full advantage

f the redundancy

investment

SLIDE 31

31

bjective

prevent node redundancy attrition

SLIDE 32

32

bjective

identify and implement mechanisms to diagnose and reintegrate temporary-faulty nodes that are lost

SLIDE 33

33

classify faults
exhaustively analyze how they can affect a replica
design needed mechanisms
implement and test them

steps

SLIDE 34

34

classify faults
exhaustively analyze how they can affect a replica
design needed mechanisms
implement and test them  pending

steps

SLIDE 35

35

we plan to quantify the reliability improvement

SLIDE 36

36

thank you for your attention !!

System Architecture

DECS
FT4FTT
Node

Replica on

Link

Replica on

Switch

Duplica on

Control

Applica on

Designing

fault-diagnosis and reintegra on to prevent node redundancy a ri on in highly reliable control systems based

n

FTT-Ethernet

Sinisa Derasevic, Manuel Barranco, Julián Proenza DMI, Universitat de les Illes Balears, Spain sinishadj@gmail.com, manuel.barranco@uib.es, julian.proenza@uib.es Abstract

DistributedEmbeddedControlSystems(DECSs)usedforReal- Time (RT) cri cal applica ons must sa sfy stringent me requirementsanda ainhighreliability.FTT-Ethernetprovides nodesofDECSswithreal- mecommunica oncapabili es,but doesnotincludeFaultTolerance(FT)mechanisms.TheFT4FTT projectaimsatproposingacompleteFTarchitectureforRT cri cal DECSs. It uses a duplicated switched FTT-Ethernet star and ac ve node replica on with consistent distributed majority vo ng to respec vely tolerate channel and node faults. However, FT4FTT,initscurrentstate,s lllacksmechanismstoprevent node redundancy a ri on due to temporary faults affec ng the nodesandchannel,whicharethemostlikelytypesoffaultsin DESs.Thispaperpresentsourongoingworktocompletethe FT4FTT architecture with appropriate fault-diagnosis and reintegra on mechanisms that

vercome

this limita on. node 1 node 2 node 3 node M

…

leader switch

controller

replica 1 replica 2 replica N

… plant actuator(s) sensor(s) Extended Control Applica on Cycle to support Fault Tolerance, Diagnosis and Reintegra on

Distributed

Consistent Majority Vo ng (DCMV)

Segments

(NVP paradigm)

Error

compensa on

Replica

determinism

Control

applica on phases in FT4FTT

Sense
(S)
Exchange

Sensor Values (ESV)

Vote
n

Sensor values (VS)

Control

(C)

Exchange

Actua on Values (EAV)

Vote
n

Actua on values (VA)

Actuate

(A) Exchange also Set Point (SP) & Status

f

control

à
seamlessly

reintegra on

VCR

to reliably vote in a consistent manner

CVEP:

retransmissions

f

cc-vectors and ACKs

MS-vector

to diagnose communica on faults

TMi

S ESV VS C EAV VA A Tmi+1

a empt 1 a empt 2 … a empt K

VCR

exchange: Sensor, Set Point & Status exchange: Actua on, Set Point & Status vote

n

the exchanged values calculate: Actua on, Status segment 1 segment 2

Analysis

f

Fault Tolerance, Diagnosis & Reintegra on Mechanisms

Fault Classific a

n
Temporary

(T)

Long

Las ng

temporary

(LL)

Permanent

(P)

Temp.

manifes ng as Perm. (T…P)

Fau.

affec ng Link (FL)

Fau.

affec ng Node rep. (FN) Fault Diagnosis & Reint. mechanisms

TM

resynchroniza on

TM

Seq. Num. (TMSQ)

TM

Seq. Num. Count. (TMSQC)

Vo ng

Reintegra on Point

Communica on

Error Counter

Discrepancy

Error Counter

You

Are Alive

(YAA)

watchdog rx TM rx/tx cc-vec./ACK/SP sensor acquisi on actuator/control calcula on majority vo ng TFL TM replica on CVEP x x x LLFL node rep. & maj. vot. TM resync Voting Reint. Point node rep. & maj. vot. Voting Reint. Point x x x PFL link replication link replication x x x TFN TM replication node rep. & maj. vot. TM resync Voting Reint. Point CVEP node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point TFNP node rep. & maj. vot. YAA watchdog reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(CEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point PFN node rep. & maj. vot. node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification

Acknowledgements

SupportedbyDPI2011-22992andTEC2015-70313-R(Spanish Ministerio de econom´ıa y compe vidad), by FEDER funding and bytheEUROWEBProjectfundedbytheErasmusMundus Ac on II programme

f

the European Commission. T/F T/F T/F … T/F T/F T/F T/F … T/F T/F T/F T/F … T/F … … … … T/F T/F T/F T/F T/F T/F replica 1 replica 2 replica 3 replica N cc-vector by replica 1

acknow. by replica 2
acknow. by replica 3
acknow. by replica N

a empt 1 a empt 2 … a empt K

VCR

Message Status (MS) Vector (matrix view) vote

n

the exchanged values follower switch