Designing fault-diagnosis and reintegration to prevent node - - PowerPoint PPT Presentation

designing fault diagnosis and reintegration
SMART_READER_LITE
LIVE PREVIEW

Designing fault-diagnosis and reintegration to prevent node - - PowerPoint PPT Presentation

Designing fault-diagnosis and reintegration to prevent node redundancy attrition in highly reliable control systems based on FTT-Ethernet Sinisa Derasevic, Manuel Barranco, Julin Proenza Mathematics and Computer Science Department, University


slide-1
SLIDE 1

Designing fault-diagnosis and reintegration to prevent node redundancy attrition in highly reliable control systems based on FTT-Ethernet

Sinisa Derasevic, Manuel Barranco, Julián Proenza

Mathematics and Computer Science Department, University of the Balearic Islands (UIB), Spain

1

slide-2
SLIDE 2

diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet

2

node 1 node 2 node 3 node M ... switch

slide-3
SLIDE 3

3

node 1 node 2 node 3 node M ... switch

diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet relevant piece of FT4FTT

slide-4
SLIDE 4

4

node 1 node 2 node 3 node M ... follower switch leader switch

  • high reliability by tolerating faults at
  • switch  duplicate
  • links  duplicate
  • nodes
slide-5
SLIDE 5

5

  • high reliability by tolerating faults at
  • switch  duplicate
  • links  duplicate
  • nodes  actively replicate critical nodes & vote

node 3 node 1 node 2 node M ... follower switch leader switch

slide-6
SLIDE 6

6

which are the critical nodes?

slide-7
SLIDE 7

7

plant

C S A node M ... S A controller sensor actuation

which are the critical nodes?

slide-8
SLIDE 8

8

plant

C S A node M ... S A controller sensor actuation

which are the critical nodes?

system failure

in principle all these nodes can be considered as critical

slide-9
SLIDE 9

9

plant

C S A node M ... S A controller

which are the critical nodes?

replicate sensor and actuation nodes is trivial

A A S S

slide-10
SLIDE 10

10

which are the critical nodes?

replicate a controller node is complex: replicas must coordinate among them

plant

A node M ... sensor(s) actuator(s) S S replica 1 ... replica 2 replica N coordinate among them

slide-11
SLIDE 11

11

  • synchronize at communication & app. levels
  • using the Trigger Message (TM)
  • vote on intermediate results

how do replicas coordinate?

slide-12
SLIDE 12

12

how do replicas coordinate?

  • synchronize at communication & app. levels
  • using the Trigger Message (TM)
  • vote on intermediate results 
slide-13
SLIDE 13

13

voting

leader switch follower switch

sense app: control cycle control actuate

replica 1 replica 2 replica 3

slide-14
SLIDE 14

14

voting

leader switch follower switch

sense control actuate

replica 1 replica 2 replica 3

A A B aquire sensors

app: control cycle

slide-15
SLIDE 15

15

voting

leader switch

A A B

follower switch replica 1 replica 2 replica 3

exchange sensors aquire sensors A A B

slide-16
SLIDE 16

16

voting

replica 1 replica 2 leader switch replica 3

vote vote vote

follower switch

exchange sensors aquire sensors vote on sensors A A B A A B A A B A A B

slide-17
SLIDE 17

17

voting

leader switch

consensus A A A

follower switch replica 1 replica 2 replica 3

exchange sensors aquire sensors vote on sensors A A B A A B A A B A A B

slide-18
SLIDE 18

18

voting

leader switch follower switch

consensus

replica 1 replica 2 replica 3

sense control actuate

A A A

app: control cycle

slide-19
SLIDE 19

19

benefits of active node replication with voting ?

slide-20
SLIDE 20

20

compensate errors

replica 1 replica 2 leader switch replica 3 replica 1 replica 2

✔ e

follower switch

the sytem can correctly deliver its service

slide-21
SLIDE 21

21

replicas may recover from errors

replica 1 replica 2 leader switch replica 3 replica 1 replica 2

✔ e

replica 3

temporar y

follower switch

if replica 3 can vote replica 3 recovers and keeps participating

slide-22
SLIDE 22

22

however…

slide-23
SLIDE 23

23

replica 1 replica 2 leader switch replica 3 follower switch

what if a temporary fault makes a replica to be lost from then on ??

slide-24
SLIDE 24

24

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

temporary fault affects replica 3 internals or communication capabilities

slide-25
SLIDE 25

25

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

? ?

temporary fault affects replica 3 internals or communication capabilities replica 3 may desynchronize at the level

  • f application and/or

communication

slide-26
SLIDE 26

26

leader switch

what if a temporary fault makes a replica to be lost from then on ??

replica 1 replica 2 replica 3 follower switch

I cannot recover !

? ?

replica 3 may desynchronize at the level

  • f application and/or

communication temporary fault affects replica 3 internals or communication capabilities

slide-27
SLIDE 27

27

leader switch replica 1 replica 2 replica 3

? ?

node redundancy attrition

×

follower switch

I cannot recover ! replica 3 is not permanently faulty, but can not be used!

slide-28
SLIDE 28

28

temporary faults are more probable than permanent ones

slide-29
SLIDE 29

29

if we do not prevent redundancy attrition caused by temporary faults

slide-30
SLIDE 30

30

then we do not take full advantage

  • f the redundancy

investment

slide-31
SLIDE 31

31

  • bjective

prevent node redundancy attrition

slide-32
SLIDE 32

32

  • bjective

identify and implement mechanisms to diagnose and reintegrate temporary-faulty nodes that are lost

slide-33
SLIDE 33

33

  • classify faults
  • exhaustively analyze how they can affect a replica
  • design needed mechanisms
  • implement and test them

steps

slide-34
SLIDE 34

34

  • classify faults
  • exhaustively analyze how they can affect a replica
  • design needed mechanisms
  • implement and test them  pending

steps

slide-35
SLIDE 35

35

we plan to quantify the reliability improvement

slide-36
SLIDE 36

36

thank you for your attention !!

System Architecture

  • DECS
  • FT4FTT
  • Node
Replica on
  • Link
Replica on
  • Switch
Duplica on
  • Control
Applica on
  • Designing

fault-diagnosis and reintegra on to prevent node redundancy a ri on in highly reliable control systems based

  • n

FTT-Ethernet

Sinisa Derasevic, Manuel Barranco, Julián Proenza DMI, Universitat de les Illes Balears, Spain sinishadj@gmail.com, manuel.barranco@uib.es, julian.proenza@uib.es Abstract

DistributedEmbeddedControlSystems(DECSs)usedforReal- Time (RT) cri cal applica ons must sa sfy stringent me requirementsanda ainhighreliability.FTT-Ethernetprovides nodesofDECSswithreal- mecommunica oncapabili es,but doesnotincludeFaultTolerance(FT)mechanisms.TheFT4FTT projectaimsatproposingacompleteFTarchitectureforRT cri cal DECSs. It uses a duplicated switched FTT-Ethernet star and ac ve node replica on with consistent distributed majority vo ng to respec vely tolerate channel and node faults. However, FT4FTT,initscurrentstate,s lllacksmechanismstoprevent node redundancy a ri on due to temporary faults affec ng the nodesandchannel,whicharethemostlikelytypesoffaultsin DESs.Thispaperpresentsourongoingworktocompletethe FT4FTT architecture with appropriate fault-diagnosis and reintegra on mechanisms that
  • vercome
this limita on. node 1 node 2 node 3 node M

leader switch

controller

replica 1 replica 2 replica N

… plant actuator(s) sensor(s) Extended Control Applica on Cycle to support Fault Tolerance, Diagnosis and Reintegra on

  • Distributed
Consistent Majority Vo ng (DCMV)
  • Segments
(NVP paradigm)
  • Error
compensa on
  • Replica
determinism
  • Control
applica on phases in FT4FTT
  • Sense
  • (S)
  • Exchange
Sensor Values (ESV)
  • Vote
  • n
Sensor values (VS)
  • Control
(C)
  • Exchange
Actua on Values (EAV)
  • Vote
  • n
Actua on values (VA)
  • Actuate
(A) Exchange also Set Point (SP) & Status
  • f
control
  • à
  • seamlessly
reintegra on
  • VCR
to reliably vote in a consistent manner
  • CVEP:
retransmissions
  • f
cc-vectors and ACKs
  • MS-vector
to diagnose communica on faults
  • TMi

S ESV VS C EAV VA A Tmi+1

a empt 1 a empt 2 … a empt K

VCR

exchange: Sensor, Set Point & Status exchange: Actua on, Set Point & Status vote
  • n
the exchanged values calculate: Actua on, Status segment 1 segment 2

Analysis

  • f

Fault Tolerance, Diagnosis & Reintegra on Mechanisms

Fault Classific a
  • n
  • Temporary
(T)
  • Long
Las ng
  • temporary
(LL)
  • Permanent
(P)
  • Temp.
manifes ng as Perm. (T…P)
  • Fau.
affec ng Link (FL)
  • Fau.
affec ng Node rep. (FN) Fault Diagnosis & Reint. mechanisms
  • TM
resynchroniza on
  • TM
Seq. Num. (TMSQ)
  • TM
Seq. Num. Count. (TMSQC)
  • Vo ng
Reintegra on Point
  • Communica on
Error Counter
  • Discrepancy
Error Counter
  • You
Are Alive
  • (YAA)
watchdog rx TM rx/tx cc-vec./ACK/SP sensor acquisi on actuator/control calcula on majority vo ng TFL TM replica on CVEP x x x LLFL node rep. & maj. vot. TM resync Voting Reint. Point node rep. & maj. vot. Voting Reint. Point x x x PFL link replication link replication x x x TFN TM replication node rep. & maj. vot. TM resync Voting Reint. Point CVEP node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point node rep. & maj. vot. Voting Reint. Point TFNP node rep. & maj. vot. YAA watchdog reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(CEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point node rep. & maj. vot. diagnosis(DEC) reset TM resyn. Voting Reint. Point PFN node rep. & maj. vot. node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification node rep. & maj. vot. degraded mode diagnosis degraded mode notification

Acknowledgements

SupportedbyDPI2011-22992andTEC2015-70313-R(Spanish Ministerio de econom´ıa y compe vidad), by FEDER funding and bytheEUROWEBProjectfundedbytheErasmusMundus Ac on II programme
  • f
the European Commission. T/F T/F T/F … T/F T/F T/F T/F … T/F T/F T/F T/F … T/F … … … … T/F T/F T/F T/F T/F T/F replica 1 replica 2 replica 3 replica N cc-vector by replica 1
  • acknow. by replica 2
  • acknow. by replica 3
  • acknow. by replica N
a empt 1 a empt 2 … a empt K

VCR

Message Status (MS) Vector (matrix view) vote
  • n
the exchanged values follower switch