Defending Distributed Cyber-Physical Systems with Bounded Time - - PowerPoint PPT Presentation

defending distributed cyber physical systems with bounded
SMART_READER_LITE
LIVE PREVIEW

Defending Distributed Cyber-Physical Systems with Bounded Time - - PowerPoint PPT Presentation

Defending Distributed Cyber-Physical Systems with Bounded Time Recovery Bri Brian Sa Sandler, Neeraj Gandhi, Linh Thi Xuan Phan, Andreas Haeberlen NSF/Intel CPS PI Meeting July 2018 1 Machines in Control Vulnerable CPS can cause


slide-1
SLIDE 1

Defending Distributed 
 Cyber-Physical Systems with 
 Bounded Time Recovery

Bri Brian Sa Sandler, Neeraj Gandhi, Linh Thi Xuan Phan, Andreas Haeberlen NSF/Intel CPS PI Meeting July 2018

1

slide-2
SLIDE 2

Machines in Control

  • Vulnerable CPS can cause

disaster.

  • Explosion
  • Equipment damage
  • Power outages

BTR - NSF/Intel PI Meeting - July 2018

2

Bellingham, WA

Oil pipeline explosion after the two controlling computers failed.

We want to pre reve vent disa sast ster.

Iran

Stuxnet vulnerability destroyed centrifuges used for nuclear enrichment.

Ivano-Frankivsk, Ukraine

Controlling power grid systems were compromised leaving residents in the dark.

slide-3
SLIDE 3

Goal: General Defense

BTR - NSF/Intel PI Meeting - July 2018

3

Crashes

Byzantine Faults

Non-Crash Bugs Hacking

slide-4
SLIDE 4

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

4

Let’s take a simple example system…

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

slide-5
SLIDE 5

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

5

This system will run four applications.

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

slide-6
SLIDE 6

Example: Industrial Automation

BTR - NSF/Intel PI Meeting - July 2018

6

We’ll focus on the burner control application…

N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

slide-7
SLIDE 7

Example: Impact of Failures

BTR - NSF/Intel PI Meeting - July 2018

7 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7

What can go wrong?

N4 can send an inco corre rrect ct va value to A1 and light the building on fire. N4 can dro rop or delay delay messages and ruin the chemical processing.

slide-8
SLIDE 8

State of the Art: Byzantine Fault Tolerance

Be Benefit fits

  • Adversarial Scenarios
  • Strong Guarantees
  • Nice Programming Model
  • BTR - NSF/Intel PI Meeting - July 2018

8

slide-9
SLIDE 9

Is continuous perfection required?

  • How bad is it if the adversary

gains control?

  • Many CPS have properties

that resist quick changes

  • inertia
  • thermal capacity
  • We don’t have to always be

perfect

BTR - NSF/Intel PI Meeting - July 2018

9

We ca can leve vera rage this! s!

N4

Chemical Vat

slide-10
SLIDE 10

For how long is faulty behavior okay?

  • Different applications have different tolerances.

BTR - NSF/Intel PI Meeting - July 2018

10

A time me peri riod usu sually y exi xist sts s where re faulty y behavi vior r is s ok k so so long as s the syst system m re return rns s to its s co corre rrect ct behavi vior r within that peri riod.

DC/DC converters (STM) 20μs Direct torque control (ABB) 25μs AC/DC converters 50μs Electronic throttle control (Ford) 5ms Traction control (Ford) 20ms Micro-scale race cars 40ms Autonomous vehicle steering 50ms Energy-efficient building control 500ms

Source: M. Morari. Fast model predictive control (mpc).

slide-11
SLIDE 11

Approach: Bounded Time Recovery

  • BTR guarantees that system recovers from any fault within a

short period of time, so that the end goal will be met

  • Weaker guarantee is often sufficient

BTR - NSF/Intel PI Meeting - July 2018

11

Time

Recovery Period

Fault Recovered

Correct Operation Correct Operation

slide-12
SLIDE 12

So, how do we make this happen?

REBOUND

BTR - NSF/Intel PI Meeting - July 2018

12

slide-13
SLIDE 13

REBOUND

  • 1. Planning
  • Before system is compromised, think about what it should do.
  • System operates in different modes for any given set of faults.
  • Can drop less critical tasks as necessary.

BTR - NSF/Intel PI Meeting - July 2018

13

N2 fails

N1: N3: N4:

N1 N2 N4 N3 N1 N2 N4 N3

slide-14
SLIDE 14

Evidence

N4 is faulty.

REBOUND

  • 2. Detection

Nodes watch over each other to detect faults.

BTR - NSF/Intel PI Meeting - July 2018

14 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 3 SEND… RECV… … SEND… RECV… …

N4 is faulty

slide-15
SLIDE 15

REBOUND

  • 3. Consistency

Flood evidence throughout the system.

BTR - NSF/Intel PI Meeting - July 2018

15 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 3

N4 is faulty

slide-16
SLIDE 16

REBOUND

BTR - NSF/Intel PI Meeting - July 2018

16 N1

S1 S2 A1

N2 N3 N4

A2 A3 A4

1 2 3 4 6 8 5 7 3 8

  • 4. Adaptation

Each node independently transitions to a new mode

All nodes OK N4 is faulty All nodes OK All nodes OK N4 is faulty All node OK All no All nodes OK All nodes OK l nodes OK N4 is faulty N4 is faulty N4 faulty N4 is faulty N4 is faulty N4 is faulty N4 is faulty

slide-17
SLIDE 17

Outline

  • Problem Introduction
  • Bounded Time Recovery
  • REBOUND
  • Technical Components
  • 1. Planning
  • 2. Detection
  • 3. Consistency
  • 4. Adaptation
  • Results

BTR - NSF/Intel PI Meeting - July 2018

17

slide-18
SLIDE 18
  • 1. Planning

For every* mode, we have a precomputed schedule and plan for every node.

  • Schedule generated offline
  • When tasks should run and where
  • Many constraints
  • Dependent scheduling problem
  • Builds a tree

* Can limit the number of faults to improve computation time.

BTR - NSF/Intel PI Meeting - July 2018

18

Node 1 Faulty No Faults Link 1-2 Faulty Nodes 1&4 Faulty

… …

slide-19
SLIDE 19
  • 2. Detection

Omission Faults

  • Declare link faulty if an expected message

from a neighbor is not received

  • Declaration causes other nodes to change

mode.

  • Leverage synchrony.

Commission Faults

  • Witness/Audit Nodes and Replicas
  • If fault found, log is used as a proof of

misbehavior.

  • Large improvement over PeerReview
  • Adding synchrony

Challenge: Bounding Time of Detection

BTR - NSF/Intel PI Meeting - July 2018

19

RECV… SEND… RECV…

2 4

Audit/Witne Task (runs a replica

4 2 4 2 N1 N2

X

I declare link N1 – N be fault

RECV… SEND… RECV… RECV… SEND… RECV…

slide-20
SLIDE 20
  • 3. Consistency

We need a solution where…

  • Any two good nodes agree on the

state of the system

  • r
  • The two become aware they cannot

communicate St Stra rawma man: flood the system periodically with signed attestations of current mode

  • Actual solution is more efficient

BTR - NSF/Intel PI Meeting - July 2018

20

X

slide-21
SLIDE 21
  • 4. Adaptation
  • Each node individually transitions when its mode changes.
  • When evidence is received a mode change occurs within a

bounded period of time.

BTR - NSF/Intel PI Meeting - July 2018

21

N2 fails

N1: N3: N4:

N1 fails N4 fails N1 N2 N4 N3 N1 N2 N4 N3 N1 N2 N4 N3 N1 N2 N4 N3

N3: N4: N3:

N2 Faulty N1 & N2 Faulty N1,N2,N4 Faulty

slide-22
SLIDE 22

Challenges

  • Bounding every step of the algorithms
  • Overhead of periodic flood
  • Multisignatures drastically reduce traffic
  • Handling equivocation
  • Different nodes notifying of different faults

to their neighbors

  • Proving everything
  • Correctness
  • Completeness
  • Bounded detection
  • Bounded stabilization
  • Planning
  • Unique problem

BTR - NSF/Intel PI Meeting - July 2018

22

… … …

slide-23
SLIDE 23

Outline

  • Problem Introduction
  • Bounded Time Recovery
  • REBOUND
  • Technical Components
  • 1. Planning
  • 2. Detection
  • 3. Consistency
  • 4. Adaption
  • Results

BTR - NSF/Intel PI Meeting - July 2018

23

slide-24
SLIDE 24

Overhead of Schedule Tree

BTR - NSF/Intel PI Meeting - July 2018

24

  • Time depends on:
  • The number of

nodes.

  • Degree of network.
  • Number of faulty

nodes, f.

  • Only compute once

for the lifetime of the system.

  • Subtrees easily

parallelizable.

f = # of faulty nodes protected against

slide-25
SLIDE 25

Recovery

BTR - NSF/Intel PI Meeting - July 2018

25

Unprotected System, N2 Compromised

slide-26
SLIDE 26

Recovery

BTR - NSF/Intel PI Meeting - July 2018

26

Protected System, N2 Compromised

Recovery Period

slide-27
SLIDE 27

Recovery

BTR - NSF/Intel PI Meeting - July 2018

27

Protected System, N1, N2, N3 Compromised

slide-28
SLIDE 28

BTR - NSF/Intel PI Meeting - July 2018

28

Thank you.

Ke Key y Idea: Period of Imperfection

Many CPS can tolerate a short period of aulty behavior.

Ap Appro roach ch: Bounded Time Recovery

Bounded time recovery guarantees that the system quickly returns to correct behavior fter a fault.

So Solution: REBOUND

Algorithms and protocols to provide BTR

  • r distributed systems.