Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell - - PowerPoint PPT Presentation

exposing design flaws in shared clock systems using tla
SMART_READER_LITE
LIVE PREVIEW

Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell - - PowerPoint PPT Presentation

Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell Mull, Auxon Corporation TLA+ Conf September 12, 2019 About Me Russell Mull Software Engineer, Auxon Corporation How safe electronic systems are designed How safe electronic


slide-1
SLIDE 1

Exposing Design Flaws in Shared-Clock Systems using TLA+

Russell Mull, Auxon Corporation TLA+ Conf September 12, 2019

slide-2
SLIDE 2

About Me

Russell Mull Software Engineer, Auxon Corporation

slide-3
SLIDE 3

How safe electronic systems are designed

slide-4
SLIDE 4

How safe electronic systems are designed

  • Decide what matters (safety requirements)
slide-5
SLIDE 5

How safe electronic systems are designed

  • Decide what matters (safety requirements)
  • Decide how much it matters (Assign a Safety Integrity Level - SIL)
slide-6
SLIDE 6

How safe electronic systems are designed

  • Decide what matters (safety requirements)
  • Decide how much it matters (Assign a Safety Integrity Level - SIL)
  • Analyze the parts of the system that matter (Fault Tree Analysis)
slide-7
SLIDE 7

How safe electronic systems are designed

  • Decide what matters (safety requirements)
  • Decide how much it matters (Assign a Safety Integrity Level - SIL)
  • Analyze the parts of the system that matter (Fault Tree Analysis)
  • Not good enough? Add redundancy.
slide-8
SLIDE 8

Example: Industrial Press

slide-9
SLIDE 9

Example: Industrial Press

  • Safety requirement: Turn off press with emergency stop button
slide-10
SLIDE 10

Example: Industrial Press

  • Safety requirement: Turn off press with emergency stop button
  • SIL: 4
slide-11
SLIDE 11

Example: Industrial Press

  • Safety requirement: Turn off press with emergency stop button
  • SIL: 4
  • Fault tree: the actuator is only SIL 3
slide-12
SLIDE 12

Example: Industrial Press

  • Safety requirement: Turn off press with emergency stop button
  • SIL: 4
  • Fault tree: the actuator is only SIL 3
  • Redundancies: use two, design a SIL 4 failover mechanism
slide-13
SLIDE 13

Functional Safety

  • IEC 61508
  • Power plants, chemical plants, cars, trains, heavy machinery, etc.
slide-14
SLIDE 14

This works well, until…

slide-15
SLIDE 15

This works well, until…

slide-16
SLIDE 16

This works well, until…

slide-17
SLIDE 17

This works well, until…

slide-18
SLIDE 18

In software, shared clock failures are lumpy and unpredictable

slide-19
SLIDE 19

The story of a system made from lots of computers, sensors, actuators, and clocks

19

slide-20
SLIDE 20

A client project for ██████████

slide-21
SLIDE 21

A client project for ██████████

  • Can’t say anything specific
slide-22
SLIDE 22

A client project for ██████████

  • Can’t say anything specific
  • Relies fundamentally on a common timebase
slide-23
SLIDE 23

A client project for ██████████

  • Can’t say anything specific
  • Relies fundamentally on a common timebase
  • Appeared to be vulnerable to drift
slide-24
SLIDE 24

My Goal: Demonstrate the problem

slide-25
SLIDE 25

A naïve model

slide-26
SLIDE 26

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" }

slide-27
SLIDE 27

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ...

slide-28
SLIDE 28

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >>

slide-29
SLIDE 29

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >>

slide-30
SLIDE 30

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system)

slide-31
SLIDE 31

A naïve model

VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system) SystemStep(s) == ... SyncClocks(cs,s) == ...

slide-32
SLIDE 32

This approach is not great.

slide-33
SLIDE 33

This approach is not great.

  • Massive state explosion
slide-34
SLIDE 34

This approach is not great.

  • Massive state explosion
  • Customer doesn’t care about the sync protocol
slide-35
SLIDE 35

Model the drift, not the sync

35

slide-36
SLIDE 36

CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES node_clock, system, global_clock

Drift Modeling (1)

slide-37
SLIDE 37

Drift Modeling (1)

CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" }

slide-38
SLIDE 38

Drift Modeling (1)

CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ]

slide-39
SLIDE 39

Drift Modeling (1)

CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock

slide-40
SLIDE 40

Drift Modeling (1)

CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock SystemStep == ...

slide-41
SLIDE 41

Drift Modeling (2)

ClockStep ==

slide-42
SLIDE 42

Drift Modeling (2)

ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock)

slide-43
SLIDE 43

Drift Modeling (2)

ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock) \* Tick a node clock \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << global_clock >> /\ ClockDriftInBounds(global_clock, node_clock')

slide-44
SLIDE 44

Drift Modeling (3)

ClockDriftInBounds(g, n) == /\ g <= SIMULATED_CYCLES /\ \A node \in DOMAIN n : /\ n[node] <= SIMULATED_CYCLES /\ Abs(c[node] - g) <= BOUNDED_DRIFT

slide-45
SLIDE 45

This works better

45

slide-46
SLIDE 46

This works better

46

  • Narrower state space
slide-47
SLIDE 47

This works better

47

  • Narrower state space
  • Directly addresses relevant failure domain
slide-48
SLIDE 48

The system was more vulnerable to drift than previously thought

slide-49
SLIDE 49

Delivering a Model

  • Literate PDF
  • Makefile / .cfg file
  • Config Instructions
slide-50
SLIDE 50
  • Difficult setup
  • Easier development
  • Easier delivery

TLA+ is tricky to use this way

slide-51
SLIDE 51

Give models to your customers

slide-52
SLIDE 52

Extending the technique

  • Asymmetric Drift
  • Action on Tick
  • Cyclical Clock
slide-53
SLIDE 53

Closing Thoughts

slide-54
SLIDE 54

Closing Thoughts

  • Fake a real clock
slide-55
SLIDE 55

Closing Thoughts

  • Fake a real clock
  • Bound the drift
slide-56
SLIDE 56

Closing Thoughts

  • Fake a real clock
  • Bound the drift
  • Give models to your customers
slide-57
SLIDE 57

Closing Thoughts

  • Fake a real clock
  • Bound the drift
  • Give models to your customers
  • I owe Hillel Wayne a great debt
slide-58
SLIDE 58

Russell Mull @mullr russell@auxon.io