SLIDE 1
Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell - - PowerPoint PPT Presentation
Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell - - PowerPoint PPT Presentation
Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell Mull, Auxon Corporation TLA+ Conf September 12, 2019 About Me Russell Mull Software Engineer, Auxon Corporation How safe electronic systems are designed How safe electronic
SLIDE 2
SLIDE 3
How safe electronic systems are designed
SLIDE 4
How safe electronic systems are designed
- Decide what matters (safety requirements)
SLIDE 5
How safe electronic systems are designed
- Decide what matters (safety requirements)
- Decide how much it matters (Assign a Safety Integrity Level - SIL)
SLIDE 6
How safe electronic systems are designed
- Decide what matters (safety requirements)
- Decide how much it matters (Assign a Safety Integrity Level - SIL)
- Analyze the parts of the system that matter (Fault Tree Analysis)
SLIDE 7
How safe electronic systems are designed
- Decide what matters (safety requirements)
- Decide how much it matters (Assign a Safety Integrity Level - SIL)
- Analyze the parts of the system that matter (Fault Tree Analysis)
- Not good enough? Add redundancy.
SLIDE 8
Example: Industrial Press
SLIDE 9
Example: Industrial Press
- Safety requirement: Turn off press with emergency stop button
SLIDE 10
Example: Industrial Press
- Safety requirement: Turn off press with emergency stop button
- SIL: 4
SLIDE 11
Example: Industrial Press
- Safety requirement: Turn off press with emergency stop button
- SIL: 4
- Fault tree: the actuator is only SIL 3
SLIDE 12
Example: Industrial Press
- Safety requirement: Turn off press with emergency stop button
- SIL: 4
- Fault tree: the actuator is only SIL 3
- Redundancies: use two, design a SIL 4 failover mechanism
SLIDE 13
Functional Safety
- IEC 61508
- Power plants, chemical plants, cars, trains, heavy machinery, etc.
SLIDE 14
This works well, until…
SLIDE 15
This works well, until…
SLIDE 16
This works well, until…
SLIDE 17
This works well, until…
SLIDE 18
In software, shared clock failures are lumpy and unpredictable
SLIDE 19
The story of a system made from lots of computers, sensors, actuators, and clocks
19
SLIDE 20
A client project for ██████████
SLIDE 21
A client project for ██████████
- Can’t say anything specific
SLIDE 22
A client project for ██████████
- Can’t say anything specific
- Relies fundamentally on a common timebase
SLIDE 23
A client project for ██████████
- Can’t say anything specific
- Relies fundamentally on a common timebase
- Appeared to be vulnerable to drift
SLIDE 24
My Goal: Demonstrate the problem
SLIDE 25
A naïve model
SLIDE 26
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" }
SLIDE 27
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ...
SLIDE 28
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >>
SLIDE 29
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >>
SLIDE 30
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system)
SLIDE 31
A naïve model
VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system) SystemStep(s) == ... SyncClocks(cs,s) == ...
SLIDE 32
This approach is not great.
SLIDE 33
This approach is not great.
- Massive state explosion
SLIDE 34
This approach is not great.
- Massive state explosion
- Customer doesn’t care about the sync protocol
SLIDE 35
Model the drift, not the sync
35
SLIDE 36
CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES node_clock, system, global_clock
Drift Modeling (1)
SLIDE 37
Drift Modeling (1)
CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" }
SLIDE 38
Drift Modeling (1)
CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ]
SLIDE 39
Drift Modeling (1)
CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock
SLIDE 40
Drift Modeling (1)
CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock SystemStep == ...
SLIDE 41
Drift Modeling (2)
ClockStep ==
SLIDE 42
Drift Modeling (2)
ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock)
SLIDE 43
Drift Modeling (2)
ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock) \* Tick a node clock \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << global_clock >> /\ ClockDriftInBounds(global_clock, node_clock')
SLIDE 44
Drift Modeling (3)
ClockDriftInBounds(g, n) == /\ g <= SIMULATED_CYCLES /\ \A node \in DOMAIN n : /\ n[node] <= SIMULATED_CYCLES /\ Abs(c[node] - g) <= BOUNDED_DRIFT
SLIDE 45
This works better
45
SLIDE 46
This works better
46
- Narrower state space
SLIDE 47
This works better
47
- Narrower state space
- Directly addresses relevant failure domain
SLIDE 48
The system was more vulnerable to drift than previously thought
SLIDE 49
Delivering a Model
- Literate PDF
- Makefile / .cfg file
- Config Instructions
SLIDE 50
- Difficult setup
- Easier development
- Easier delivery
TLA+ is tricky to use this way
SLIDE 51
Give models to your customers
SLIDE 52
Extending the technique
- Asymmetric Drift
- Action on Tick
- Cyclical Clock
SLIDE 53
Closing Thoughts
SLIDE 54
Closing Thoughts
- Fake a real clock
SLIDE 55
Closing Thoughts
- Fake a real clock
- Bound the drift
SLIDE 56
Closing Thoughts
- Fake a real clock
- Bound the drift
- Give models to your customers
SLIDE 57
Closing Thoughts
- Fake a real clock
- Bound the drift
- Give models to your customers
- I owe Hillel Wayne a great debt
SLIDE 58