Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley - - PowerPoint PPT Presentation

evaluating bft protocols for spire
SMART_READER_LITE
LIVE PREVIEW

Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley - - PowerPoint PPT Presentation

Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley 600.667 Advanced Distributed Systems & Networks SCADA & Spire Overview High-Performance, Scalable Spire Trusted Platform Module Known Network Characteristics


slide-1
SLIDE 1

Evaluating BFT Protocols for Spire

Henry Schuh & Sam Beckley

600.667 Advanced Distributed Systems & Networks

slide-2
SLIDE 2
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-3
SLIDE 3

Power Grid Overview

slide-4
SLIDE 4

SCADA Overview

slide-5
SLIDE 5

SCADA Requirements

  • Must have very low latencies

(100-200ms)

  • Must have very high reliability
  • Must be able to run for decades
slide-6
SLIDE 6

SCADA Adopting IP & Internet

  • In the past SCADA used proprietary

protocols on air gapped systems

  • Now moving to both IP & the Internet

to reduce costs

slide-7
SLIDE 7

“These devices were not only internet facing, they did not have
 security mechanisms to prevent unauthorized access”

  • Trend Micro Incorporated, Who’s Really Attacking Your ICS Systems
slide-8
SLIDE 8

Attacks on SCADA Systems

28 Days: 39 Attacks
 All targeted specifically at SCADA systems
 
 The first attack was within 18 hours of the honeypot going live

Source: Trend Micro Incorporated, Who’s Really Attacking Your ICS Systems

slide-9
SLIDE 9

Distributed Replication

  • Several machines that coordinate their

actions such that they appear to be a single unified machine to a client.
 
 Pros: High Availability and Performance
 Cons: Cost of Synchronization

slide-10
SLIDE 10

Intrusion Tolerant Replication

Somewhat Formally: The ability to make progress in the presence of some number of malicious replicas with guaranteed

  • correctness. Some protocols also guarantee a level
  • f performance under attack.


 Informally: If some of the replicas get hacked the system still works.

slide-11
SLIDE 11

Defense Across Space & Time

Defense Across Time: Have to periodically regain control of a compromised machine to stop the attacker from eventually gaining control of the entire network. Defense Across Space: Every replica must present a unique attack surface so that one attack cannot be used to compromise every replica.

slide-12
SLIDE 12
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-13
SLIDE 13

Spire

Open Source SCADA system that provides both standard crypto defense mechanisms as well as an intrusion tolerant SCADA Master. Spire uses several different technologies

  • Prime
  • Spines
  • PVBrowser
slide-14
SLIDE 14

Spire

SCADA Master Prime pvbrowser HMI SCADA Master Prime SCADA Master Prime SCADA Master Prime RTU / PLC Proxy RTU

External Spines Network Internal Spines Network

RTU / PLC Proxy PLC

slide-15
SLIDE 15

Scaling Spire

In order to tolerate more intrusions we need more replicas The more replicas, the higher the latency becomes We rely on having very low latency

slide-16
SLIDE 16

Our Mission

Find a way to make Spire more scalable, to allow for more replicas, and thus more intrusions

slide-17
SLIDE 17

3 Angles of Attack

Trusted Hardware - using a TPM Taking Advantage of Known Network Characteristics Hierarchy of Protocols

slide-18
SLIDE 18
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-19
SLIDE 19

Trusted Platform Module

Specialized chip that holds a secret key and can perform cryptographic functions for the rest of the machine The key never leaves the TPM

Too slow :’(

slide-20
SLIDE 20
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-21
SLIDE 21

Leverage Network Characteristics

SCADA deployments are static and predictable Most importantly, we know:

  • Geographically close - low latency

communication

  • Consistent number of clients and messaging

pattern

slide-22
SLIDE 22

The Three BFT Protocol Families

PBFT Spinning Prime

slide-23
SLIDE 23

PBFT

PBFT Spinning Prime

slide-24
SLIDE 24

PBFT

When the leader fails we must perform a “view change” This is by far the most expensive operation in PBFT
 
 “[The view change] is the Achilles Heel”

  • Yair Amir
slide-25
SLIDE 25

Spinning

Every ordering is done by a different leader A bad leader can delay exactly one ordering before it is evicted from the protocol

slide-26
SLIDE 26

Prime

Designed to remove load from the leader to allow for many clients without performance degradation Performs one ordering every X milliseconds

slide-27
SLIDE 27

Prime

slide-28
SLIDE 28
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-29
SLIDE 29

BFT-SMART

  • Implements “Yet Another Visit to Paxos” protocol 


(IBM Zurich) in Java

  • Modular, multi-threaded server replicas
  • Standard BFT message pattern
  • Modern protocol with ongoing development
slide-30
SLIDE 30

Multithreaded Design

Request Thread 1 Leader Thread Request Timer Thread Message Processor Thread Receiver Thread 1 Receiver Thread n-1 Reply Thread Service Replica

Client Request Server Reply

Sender Thread 1 Sender Thread n-1

Server Consensus Communication

… …

slide-31
SLIDE 31

(primary) 1 2 3 Pre-Prepare Malicious Delay

BFT-SMART and Performance Attacks

  • Consensus relies on leader to order messages
  • A malicious leader could delay progress
  • Timeouts limit the leader’s worst-cast performance

Leader Client Replica 1 Replica 2 Replica 3 Propose (Pre-Prepare)

Malicious Delay

slide-32
SLIDE 32
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-33
SLIDE 33

Simulating a SCADA Network

WAS JHU NYC

SVG

3 replicas per site n = 12 f = 3

3ms 3ms 4ms 2ms 2ms 4ms

slide-34
SLIDE 34

Normal-Case Latency

5 10 15 20 25 30 35 40 45 10 20 30 40 50 60 70 80 90 100 Mean Latency (ms) Number of Clients

Me Mean Latency vs. Number of Clients

BFT-SMART Prime

slide-35
SLIDE 35

Normal-Case Latency

  • Significantly lower with BFT-SMART, but

increasing with number of clients

  • Matches expectations given fewer

consensus rounds

  • Constant with Prime, due to batch
  • rdering on a preset interval of 20ms
slide-36
SLIDE 36

Performance Attack Latency

  • Tested 4 timeouts, chosen based on normal performance
  • 1. 8ms (aggressive)
  • 2. 10ms (conservative)
slide-37
SLIDE 37

Performance Attack Latency

  • Tested 4 timeouts, chosen based on normal performance
  • 1. 8ms (aggressive)
  • 2. 10ms (conservative)
  • 3. 16ms (aggressive, forwarding request at 8ms)
  • 4. 20ms (conservative, forwarding request at 10ms)
slide-38
SLIDE 38

Performance Attack Latency

  • Developed a malicious replica to delay

sending pre-prepare messages as leader

  • Experimentally maximized delay up to

each view change timeout

  • Measured worst-case latency seen by

client under this condition

slide-39
SLIDE 39

Performance Attack Latency

5 10 15 20 25 30 35 5 7 9 11 13 15 17 19 21 23 Mean Worst-Case Latency (ms) Pre-Prepare Timeout (ms)

Me Measured Latency vs. Timeout

Worst-Case Latency Normal Latency

slide-40
SLIDE 40

Performance Attack Latency

  • With a tight timeout, performance degradation is

minimal

  • With a conservative timeout, performance degradation

approaches 50% (26ms latency)

  • In either case, lower than normal-case Prime and exceeds

the required performance

  • This performance attack would not pose a risk to the

SCADA system

slide-41
SLIDE 41

View Change

  • 50-70ms depending on number of pending requests
  • Slow due to unoptimized serialization, data structures, taking up

to 40ms

  • Sequential view changes are an issue with multiple faulty replicas
  • With f ≥ 3, view change must be improved to meet the

200ms requirement

  • Prime view changes are on the order of 60-90ms
slide-42
SLIDE 42

Scalability Overhead

100 200 300 400 500 600 5 10 15 20 25 La Latency (µs) Nu Number of replicas (n)

LA LAN La Latency vs. Number of Replicas

slide-43
SLIDE 43

Scalability Overhead

  • Shows the computational overhead of increasing n
  • Latency appears linear with n, and grows at a

reasonable rate

  • Actual latency determined by location of added

replicas

  • Another geographic site vs. more replicas 


per site

slide-44
SLIDE 44
  • SCADA & Spire Overview
  • High-Performance, Scalable Spire
  • Trusted Platform Module
  • Known Network Characteristics
  • Evaluating BFT-SMART
  • Benchmarking Results
  • Conclusions
slide-45
SLIDE 45

BFT-SMART: Pros & Cons

PROS

  • Lightweight protocol & implementation
  • Possible to apply aggressive timeout
  • Low normal-case latency
  • Support for dynamic state transfer, reconfiguration/recovery

CONS

  • Latency increases with number of clients, concurrent requests
  • High view change cost
  • Java implementation
slide-46
SLIDE 46

Prime: Pros & Cons

PROS

  • Leader is not burdened by client requests
  • Bounded performance guarantee under attack
  • Latency remains constant as number of clients increases
  • Measurements performed so replicas can adapt to network conditions

CONS

  • 2 more consensus rounds per ordering
  • High view change cost
  • Significantly higher normal-case latency
slide-47
SLIDE 47

Conclusions

  • Strict limit on performance attacks possible with a

lightweight protocol and bounded network latencies

  • View change still a high cost, but could be
  • ptimized
  • A viable path to scaling Spire
  • However, BFT-SMART introduces some new issues
slide-48
SLIDE 48

Conclusions: BFT-SMART

  • BFT-SMART is a good implementation, but

not exactly what we need

  • Very good proof of concept that something

with weaker guarantees than Prime could

  • utperform Prime in this specific context

using known network characteristics

slide-49
SLIDE 49

Conclusions: Prime

  • We want some of the features Prime has,

specifically, network measurements and batching.

  • We can live without Prime’s expensive offloading
  • f the leader - we can assume that the computers

can do the intended job fast enough (need to measure how long it takes for a full update compared with how long it takes for immediate response in Prime).

slide-50
SLIDE 50

Next Steps

  • Consider diversity and client-server

communication

  • Interface with the Spines and SCADA

hardware

  • Or, apply this approach to something

new?

slide-51
SLIDE 51

Thank You

  • To Yair, Tom, Amy and Trevor
  • To the class
  • To Alysson Bessani and the 


BFT-SMART group