SLIDE 1 Evaluating BFT Protocols for Spire
Henry Schuh & Sam Beckley
600.667 Advanced Distributed Systems & Networks
SLIDE 2
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 3
Power Grid Overview
SLIDE 4
SCADA Overview
SLIDE 5 SCADA Requirements
- Must have very low latencies
(100-200ms)
- Must have very high reliability
- Must be able to run for decades
SLIDE 6 SCADA Adopting IP & Internet
- In the past SCADA used proprietary
protocols on air gapped systems
- Now moving to both IP & the Internet
to reduce costs
SLIDE 7 “These devices were not only internet facing, they did not have
security mechanisms to prevent unauthorized access”
- Trend Micro Incorporated, Who’s Really Attacking Your ICS Systems
SLIDE 8 Attacks on SCADA Systems
28 Days: 39 Attacks
All targeted specifically at SCADA systems
The first attack was within 18 hours of the honeypot going live
Source: Trend Micro Incorporated, Who’s Really Attacking Your ICS Systems
SLIDE 9 Distributed Replication
- Several machines that coordinate their
actions such that they appear to be a single unified machine to a client.
Pros: High Availability and Performance
Cons: Cost of Synchronization
SLIDE 10 Intrusion Tolerant Replication
Somewhat Formally: The ability to make progress in the presence of some number of malicious replicas with guaranteed
- correctness. Some protocols also guarantee a level
- f performance under attack.
Informally: If some of the replicas get hacked the system still works.
SLIDE 11
Defense Across Space & Time
Defense Across Time: Have to periodically regain control of a compromised machine to stop the attacker from eventually gaining control of the entire network. Defense Across Space: Every replica must present a unique attack surface so that one attack cannot be used to compromise every replica.
SLIDE 12
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 13 Spire
Open Source SCADA system that provides both standard crypto defense mechanisms as well as an intrusion tolerant SCADA Master. Spire uses several different technologies
SLIDE 14 Spire
SCADA Master Prime pvbrowser HMI SCADA Master Prime SCADA Master Prime SCADA Master Prime RTU / PLC Proxy RTU
External Spines Network Internal Spines Network
RTU / PLC Proxy PLC
SLIDE 15
Scaling Spire
In order to tolerate more intrusions we need more replicas The more replicas, the higher the latency becomes We rely on having very low latency
SLIDE 16
Our Mission
Find a way to make Spire more scalable, to allow for more replicas, and thus more intrusions
SLIDE 17
3 Angles of Attack
Trusted Hardware - using a TPM Taking Advantage of Known Network Characteristics Hierarchy of Protocols
SLIDE 18
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 19
Trusted Platform Module
Specialized chip that holds a secret key and can perform cryptographic functions for the rest of the machine The key never leaves the TPM
Too slow :’(
SLIDE 20
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 21 Leverage Network Characteristics
SCADA deployments are static and predictable Most importantly, we know:
- Geographically close - low latency
communication
- Consistent number of clients and messaging
pattern
SLIDE 22
The Three BFT Protocol Families
PBFT Spinning Prime
SLIDE 23
PBFT
PBFT Spinning Prime
SLIDE 24 PBFT
When the leader fails we must perform a “view change” This is by far the most expensive operation in PBFT
“[The view change] is the Achilles Heel”
SLIDE 25
Spinning
Every ordering is done by a different leader A bad leader can delay exactly one ordering before it is evicted from the protocol
SLIDE 26
Prime
Designed to remove load from the leader to allow for many clients without performance degradation Performs one ordering every X milliseconds
SLIDE 27
Prime
SLIDE 28
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 29 BFT-SMART
- Implements “Yet Another Visit to Paxos” protocol
(IBM Zurich) in Java
- Modular, multi-threaded server replicas
- Standard BFT message pattern
- Modern protocol with ongoing development
SLIDE 30 Multithreaded Design
Request Thread 1 Leader Thread Request Timer Thread Message Processor Thread Receiver Thread 1 Receiver Thread n-1 Reply Thread Service Replica
Client Request Server Reply
Sender Thread 1 Sender Thread n-1
Server Consensus Communication
… …
SLIDE 31 (primary) 1 2 3 Pre-Prepare Malicious Delay
BFT-SMART and Performance Attacks
- Consensus relies on leader to order messages
- A malicious leader could delay progress
- Timeouts limit the leader’s worst-cast performance
Leader Client Replica 1 Replica 2 Replica 3 Propose (Pre-Prepare)
Malicious Delay
SLIDE 32
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 33 Simulating a SCADA Network
WAS JHU NYC
SVG
3 replicas per site n = 12 f = 3
3ms 3ms 4ms 2ms 2ms 4ms
SLIDE 34 Normal-Case Latency
5 10 15 20 25 30 35 40 45 10 20 30 40 50 60 70 80 90 100 Mean Latency (ms) Number of Clients
Me Mean Latency vs. Number of Clients
BFT-SMART Prime
SLIDE 35 Normal-Case Latency
- Significantly lower with BFT-SMART, but
increasing with number of clients
- Matches expectations given fewer
consensus rounds
- Constant with Prime, due to batch
- rdering on a preset interval of 20ms
SLIDE 36 Performance Attack Latency
- Tested 4 timeouts, chosen based on normal performance
- 1. 8ms (aggressive)
- 2. 10ms (conservative)
SLIDE 37 Performance Attack Latency
- Tested 4 timeouts, chosen based on normal performance
- 1. 8ms (aggressive)
- 2. 10ms (conservative)
- 3. 16ms (aggressive, forwarding request at 8ms)
- 4. 20ms (conservative, forwarding request at 10ms)
SLIDE 38 Performance Attack Latency
- Developed a malicious replica to delay
sending pre-prepare messages as leader
- Experimentally maximized delay up to
each view change timeout
- Measured worst-case latency seen by
client under this condition
SLIDE 39 Performance Attack Latency
5 10 15 20 25 30 35 5 7 9 11 13 15 17 19 21 23 Mean Worst-Case Latency (ms) Pre-Prepare Timeout (ms)
Me Measured Latency vs. Timeout
Worst-Case Latency Normal Latency
SLIDE 40 Performance Attack Latency
- With a tight timeout, performance degradation is
minimal
- With a conservative timeout, performance degradation
approaches 50% (26ms latency)
- In either case, lower than normal-case Prime and exceeds
the required performance
- This performance attack would not pose a risk to the
SCADA system
SLIDE 41 View Change
- 50-70ms depending on number of pending requests
- Slow due to unoptimized serialization, data structures, taking up
to 40ms
- Sequential view changes are an issue with multiple faulty replicas
- With f ≥ 3, view change must be improved to meet the
200ms requirement
- Prime view changes are on the order of 60-90ms
SLIDE 42 Scalability Overhead
100 200 300 400 500 600 5 10 15 20 25 La Latency (µs) Nu Number of replicas (n)
LA LAN La Latency vs. Number of Replicas
SLIDE 43 Scalability Overhead
- Shows the computational overhead of increasing n
- Latency appears linear with n, and grows at a
reasonable rate
- Actual latency determined by location of added
replicas
- Another geographic site vs. more replicas
per site
SLIDE 44
- SCADA & Spire Overview
- High-Performance, Scalable Spire
- Trusted Platform Module
- Known Network Characteristics
- Evaluating BFT-SMART
- Benchmarking Results
- Conclusions
SLIDE 45 BFT-SMART: Pros & Cons
PROS
- Lightweight protocol & implementation
- Possible to apply aggressive timeout
- Low normal-case latency
- Support for dynamic state transfer, reconfiguration/recovery
CONS
- Latency increases with number of clients, concurrent requests
- High view change cost
- Java implementation
SLIDE 46 Prime: Pros & Cons
PROS
- Leader is not burdened by client requests
- Bounded performance guarantee under attack
- Latency remains constant as number of clients increases
- Measurements performed so replicas can adapt to network conditions
CONS
- 2 more consensus rounds per ordering
- High view change cost
- Significantly higher normal-case latency
SLIDE 47 Conclusions
- Strict limit on performance attacks possible with a
lightweight protocol and bounded network latencies
- View change still a high cost, but could be
- ptimized
- A viable path to scaling Spire
- However, BFT-SMART introduces some new issues
SLIDE 48 Conclusions: BFT-SMART
- BFT-SMART is a good implementation, but
not exactly what we need
- Very good proof of concept that something
with weaker guarantees than Prime could
- utperform Prime in this specific context
using known network characteristics
SLIDE 49 Conclusions: Prime
- We want some of the features Prime has,
specifically, network measurements and batching.
- We can live without Prime’s expensive offloading
- f the leader - we can assume that the computers
can do the intended job fast enough (need to measure how long it takes for a full update compared with how long it takes for immediate response in Prime).
SLIDE 50 Next Steps
- Consider diversity and client-server
communication
- Interface with the Spines and SCADA
hardware
- Or, apply this approach to something
new?
SLIDE 51 Thank You
- To Yair, Tom, Amy and Trevor
- To the class
- To Alysson Bessani and the
BFT-SMART group