Designing fault-diagnosis and reintegration to prevent node redundancy attrition in highly reliable control systems based on FTT-Ethernet Sinisa Derasevic, Manuel Barranco, Julián Proenza Mathematics and Computer Science Department, University of the Balearic Islands (UIB), Spain 1
diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet node node node node ... 1 2 3 M switch 2
diagnosis and reintegration of faulty nodes in highly reliable Distributed Control Systems based on FTT-Ethernet relevant piece of FT4FTT node node node node ... 1 2 3 M switch 3
• high reliability by tolerating faults at o switch duplicate o links duplicate o nodes node node node node ... 1 2 3 M leader follower switch switch 4
• high reliability by tolerating faults at o switch duplicate o links duplicate o nodes actively replicate critical nodes & vote node node node node ... 1 2 3 M leader follower switch switch 5
which are the critical nodes? 6
which are the critical nodes? plant S A node ... S A C M sensor actuation controller 7
which are the critical nodes? in principle all these nodes can be considered as critical plant S A system failure node ... S A C M sensor actuation controller 8
which are the critical nodes? replicate sensor and actuation nodes is trivial plant S A node ... S A C M S A controller S A 9
which are the critical nodes? replicate a controller node is complex: replicas must coordinate among them plant S sensor(s) actuator(s) node ... S A M replica replica replica ... 1 2 N coordinate among them 10
how do replicas coordinate? • synchronize at communication & app. levels o using the Trigger Message (TM) • vote on intermediate results 11
how do replicas coordinate? • synchronize at communication & app. levels o using the Trigger Message (TM) • vote on intermediate results 12
voting app: sense control actuate control cycle replica replica replica 1 2 3 leader follower switch switch 13
voting app: sense control actuate control cycle A A B aquire replica replica replica sensors 1 2 3 leader follower switch switch 14
voting A A B aquire replica replica replica sensors 1 2 3 exchange A A B sensors leader follower switch switch 15
voting vote on vote vote vote sensors A A B A A B A A B aquire replica replica replica sensors 1 2 3 exchange A A B sensors leader follower switch switch 16
voting vote on A A A consensus sensors A A B A A B A A B aquire replica replica replica sensors 1 2 3 exchange A A B sensors leader follower switch switch 17
voting app: sense control actuate control cycle A A A consensus replica replica replica 1 2 3 leader follower switch switch 18
benefits of active node replication with voting ? 19
compensate errors the sytem can correctly deliver ✔ replica replica 1 2 its service e replica replica replica 1 2 3 leader follower switch switch 20
replicas may recover from errors ✔ replica 3 recovers if replica 3 replica replica replica and keeps can vote 1 2 3 participating e temporar replica replica replica y 1 2 3 leader follower switch switch 21
however … 22
what if a temporary fault makes a replica to be lost from then on ?? replica replica replica 1 2 3 leader follower switch switch 23
what if a temporary fault makes a replica to be lost from then on ?? temporary fault affects replica 3 internals or communication capabilities replica replica replica 1 2 3 leader follower switch switch 24
what if a temporary fault makes a replica to be lost from then on ?? temporary fault affects replica 3 internals or communication capabilities ? replica replica replica 1 2 3 replica 3 may desynchronize at the ? level of application and/or communication leader follower switch switch 25
what if a temporary fault makes a replica to be lost from then on ?? temporary fault affects replica 3 internals or communication capabilities I cannot recover ! ? replica replica replica 1 2 3 replica 3 may desynchronize at the ? level of application and/or communication leader follower switch switch 26
node redundancy attrition replica 3 is not permanently faulty, I cannot recover ! but can not be used ! ? × replica replica replica 1 2 3 ? leader follower switch switch 27
temporary faults are more probable than permanent ones 28
if we do not prevent redundancy attrition caused by temporary faults 29
then we do not take full advantage of the redundancy investment 30
objective prevent node redundancy attrition 31
objective identify and implement mechanisms to diagnose and reintegrate temporary-faulty nodes that are lost 32
steps • classify faults • exhaustively analyze how they can affect a replica • design needed mechanisms • implement and test them 33
steps • classify faults • exhaustively analyze how they can affect a replica • design needed mechanisms • implement and test them pending 34
we plan to quantify the reliability improvement 35
Recommend
More recommend