[PPT] - Defending Distributed Cyber-Physical Systems with Bounded Time PowerPoint Presentation

SLIDE 1

Defending Distributed   Cyber-Physical Systems with   Bounded Time Recovery

Bri Brian Sa Sandler, Neeraj Gandhi, Linh Thi Xuan Phan, Andreas Haeberlen NSF/Intel CPS PI Meeting July 2018

1

SLIDE 2

Machines in Control

Vulnerable CPS can cause

disaster.

Explosion
Equipment damage
Power outages
…

BTR - NSF/Intel PI Meeting - July 2018

2

Bellingham, WA

Oil pipeline explosion after the two controlling computers failed.

We want to pre reve vent disa sast ster.

Iran

Stuxnet vulnerability destroyed centrifuges used for nuclear enrichment.

Ivano-Frankivsk, Ukraine

Controlling power grid systems were compromised leaving residents in the dark.

SLIDE 3

Goal: General Defense

BTR - NSF/Intel PI Meeting - July 2018

3

Crashes

Byzantine Faults

Non-Crash Bugs Hacking

SLIDE 4

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

4

Let’s take a simple example system…

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

SLIDE 5

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

5

This system will run four applications.

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

SLIDE 6

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

6

We’ll focus on the burner control application…

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

SLIDE 7

Example: Impact of Failures

BTR - NSF/Intel PI Meeting - July 2018

7 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

What can go wrong?

N4 can send an inco corre rrect ct va value to A1 and light the building on fire. N4 can dro rop or delay delay messages and ruin the chemical processing.

SLIDE 8

State of the Art: Byzantine Fault Tolerance

Be Benefit fits

Adversarial Scenarios
Strong Guarantees
Nice Programming Model
BTR - NSF/Intel PI Meeting - July 2018

8

SLIDE 9

Is continuous perfection required?

How bad is it if the adversary

gains control?

Many CPS have properties

that resist quick changes

inertia
thermal capacity
We don’t have to always be

perfect

BTR - NSF/Intel PI Meeting - July 2018

9

We ca can leve vera rage this! s!

N4

Chemical Vat

SLIDE 10

For how long is faulty behavior okay?

Different applications have different tolerances.

BTR - NSF/Intel PI Meeting - July 2018

10

A time me peri riod usu sually y exi xist sts s where re faulty y behavi vior r is s ok k so so long as s the syst system m re return rns s to its s co corre rrect ct behavi vior r within that peri riod.

DC/DC converters (STM) 20μs Direct torque control (ABB) 25μs AC/DC converters 50μs Electronic throttle control (Ford) 5ms Traction control (Ford) 20ms Micro-scale race cars 40ms Autonomous vehicle steering 50ms Energy-efficient building control 500ms

Source: M. Morari. Fast model predictive control (mpc).

SLIDE 11

Approach: Bounded Time Recovery

BTR guarantees that system recovers from any fault within a

short period of time, so that the end goal will be met

Weaker guarantee is often sufficient

BTR - NSF/Intel PI Meeting - July 2018

11

Time

Recovery Period

Fault Recovered

Correct Operation Correct Operation

SLIDE 12

So, how do we make this happen?

REBOUND

BTR - NSF/Intel PI Meeting - July 2018

12

SLIDE 13

REBOUND

1. Planning
Before system is compromised, think about what it should do.
System operates in different modes for any given set of faults.
Can drop less critical tasks as necessary.

BTR - NSF/Intel PI Meeting - July 2018

13

N2 fails

N1: N3: N4:

N1 N2 N4 N3 N1 N2 N4 N3

SLIDE 14

Evidence

N4 is faulty.

REBOUND

2. Detection

Nodes watch over each other to detect faults.

BTR - NSF/Intel PI Meeting - July 2018

14 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 3 SEND… RECV… … SEND… RECV… …

N4 is faulty

SLIDE 15

REBOUND

3. Consistency

Flood evidence throughout the system.

BTR - NSF/Intel PI Meeting - July 2018

15 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 3

N4 is faulty

SLIDE 16

REBOUND

BTR - NSF/Intel PI Meeting - July 2018

16 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 8

4. Adaptation

Each node independently transitions to a new mode

All nodes OK N4 is faulty All nodes OK All nodes OK N4 is faulty All node OK All no All nodes OK All nodes OK l nodes OK N4 is faulty N4 is faulty N4 faulty N4 is faulty N4 is faulty N4 is faulty N4 is faulty

SLIDE 17

Outline

Problem Introduction
Bounded Time Recovery
REBOUND
Technical Components
1. Planning
2. Detection
3. Consistency
4. Adaptation
Results

BTR - NSF/Intel PI Meeting - July 2018

17

SLIDE 18

1. Planning

For every* mode, we have a precomputed schedule and plan for every node.

Schedule generated offline
When tasks should run and where
Many constraints
Dependent scheduling problem
Builds a tree

* Can limit the number of faults to improve computation time.

BTR - NSF/Intel PI Meeting - July 2018

18

Node 1 Faulty No Faults Link 1-2 Faulty Nodes 1&4 Faulty

… …

SLIDE 19

2. Detection

Omission Faults

Declare link faulty if an expected message

from a neighbor is not received

Declaration causes other nodes to change

mode.

Leverage synchrony.

Commission Faults

Witness/Audit Nodes and Replicas
If fault found, log is used as a proof of

misbehavior.

Large improvement over PeerReview
Adding synchrony

Challenge: Bounding Time of Detection

BTR - NSF/Intel PI Meeting - July 2018

19

RECV… SEND… RECV…

2 4

Audit/Witne Task (runs a replica

4 2 4 2 N1 N2

X

I declare link N1 – N be fault

RECV… SEND… RECV… RECV… SEND… RECV…

SLIDE 20

3. Consistency

We need a solution where…

Any two good nodes agree on the

state of the system

r
The two become aware they cannot

communicate St Stra rawma man: flood the system periodically with signed attestations of current mode

Actual solution is more efficient

BTR - NSF/Intel PI Meeting - July 2018

20

X

SLIDE 21

4. Adaptation
Each node individually transitions when its mode changes.
When evidence is received a mode change occurs within a

bounded period of time.

BTR - NSF/Intel PI Meeting - July 2018

21

N2 fails

N1: N3: N4:

N1 fails N4 fails N1 N2 N4 N3 N1 N2 N4 N3 N1 N2 N4 N3 N1 N2 N4 N3

N3: N4: N3:

N2 Faulty N1 & N2 Faulty N1,N2,N4 Faulty

SLIDE 22

Challenges

Bounding every step of the algorithms
Overhead of periodic flood
Multisignatures drastically reduce traffic
Handling equivocation
Different nodes notifying of different faults

to their neighbors

Proving everything
Correctness
Completeness
Bounded detection
Bounded stabilization
Planning
Unique problem

BTR - NSF/Intel PI Meeting - July 2018

22

… … …

SLIDE 23

Outline

Problem Introduction
Bounded Time Recovery
REBOUND
Technical Components
1. Planning
2. Detection
3. Consistency
4. Adaption
Results

BTR - NSF/Intel PI Meeting - July 2018

23

SLIDE 24

Overhead of Schedule Tree

BTR - NSF/Intel PI Meeting - July 2018

24

Time depends on:
The number of

nodes.

Degree of network.
Number of faulty

nodes, f.

Only compute once

for the lifetime of the system.

Subtrees easily

parallelizable.

f = # of faulty nodes protected against

SLIDE 25

Recovery

BTR - NSF/Intel PI Meeting - July 2018

25

Unprotected System, N2 Compromised

SLIDE 26

Recovery

BTR - NSF/Intel PI Meeting - July 2018

26

Protected System, N2 Compromised

Recovery Period

SLIDE 27

Recovery

BTR - NSF/Intel PI Meeting - July 2018

27

Protected System, N1, N2, N3 Compromised

SLIDE 28

BTR - NSF/Intel PI Meeting - July 2018

28

Thank you.

Ke Key y Idea: Period of Imperfection

Many CPS can tolerate a short period of aulty behavior.

Ap Appro roach ch: Bounded Time Recovery

Bounded time recovery guarantees that the system quickly returns to correct behavior fter a fault.

So Solution: REBOUND

Algorithms and protocols to provide BTR

r distributed systems.

Defending Distributed Cyber-Physical Systems with Bounded Time Recovery

Machines in Control

Goal: General Defense

Example: Industrial Automation

Example: Industrial Automation

Example: Industrial Automation

Example: Impact of Failures

State of the Art: Byzantine Fault Tolerance

Is continuous perfection required?

Chemical Vat

For how long is faulty behavior okay?

Approach: Bounded Time Recovery

So, how do we make this happen?

REBOUND

REBOUND

REBOUND

REBOUND

REBOUND

Outline

X

X

Challenges

Outline

Overhead of Schedule Tree

Recovery

Recovery

Recovery

Thank you.

Defending Distributed   Cyber-Physical Systems with   Bounded Time Recovery