UpRight Cluster Services Allen Clement, Manos Kapritsos, Sangmin - - PowerPoint PPT Presentation

upright cluster services
SMART_READER_LITE
LIVE PREVIEW

UpRight Cluster Services Allen Clement, Manos Kapritsos, Sangmin - - PowerPoint PPT Presentation

UpRight Cluster Services Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, Taylor Riche The University of Texas at Austin Failures are not fail-stop To the rescue Byzantine Fault Tolerance (BFT) tolerate


slide-1
SLIDE 1

UpRight Cluster Services

Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, Taylor Riche The University of Texas at Austin

slide-2
SLIDE 2

Failures are not fail-stop

slide-3
SLIDE 3

To the rescue

Byzantine Fault Tolerance (BFT) tolerate arbitrary failures safe always good performance with failures if network behaves well eventually live f

slide-4
SLIDE 4

This talk

BFT in real systems ZooKeeper, Hadoop Distributed File System What does it take? Revising much of what we think we know Failure model BFT implementation API

slide-5
SLIDE 5

This talk

BFT in real systems ZooKeeper, Hadoop Distributed File System What does it take? Revising much of what we think we know Failure model BFT implementation API

slide-6
SLIDE 6

For better or for worse

Byzantine model is most general all you need are replicas... 3f +1

slide-7
SLIDE 7

Up Right

slide-8
SLIDE 8

Up

maximum number of failures under which liveness* is ensured

u=

slide-9
SLIDE 9

Right

maximum number of malicious failures under which safety is preserved

r=

slide-10
SLIDE 10

Up Right

maximum number of failures under which liveness* is ensured maximum number of malicious failures under which safety is preserved

  • (Lamport 2003; Dutta et al 2005; El-Malek et al 2005)
  • u=

r=

slide-11
SLIDE 11

maximum number of failures under which liveness* is ensured maximum number of malicious failures under which safety is preserved

  • (Lamport 2003; Dutta et al 2005; El-Malek et al 2005)

agreement : replicas

  • Up Right

u= r=

2u+r+1

slide-12
SLIDE 12

Up Right

maximum number of failures under which liveness* is ensured maximum number of malicious failures under which safety is preserved

  • (Lamport 2003; Dutta et al 2005; El-Malek et al 2005)

u= r=

BFT

“Pay per B” FT

2u+r+1

Replicas required for agreement

Crash tolerant

u=0 u=1 u=2 u=3 r=0 1 3 5 7 r=1 2 4 6 8 r=2 3 5 7 9 r=3 4 6 8 10

slide-13
SLIDE 13

One Library to Rule Them All

Only pay for the fault tolerance you need One fault tolerant library

slide-14
SLIDE 14

Revising what we think we know

Failure model BFT implementation API

slide-15
SLIDE 15

Separating Order from Execution

Execution Order Request Ordered request

Separating agreement from execution for Byzantine fault tolerant services [SOSP 2003]

slide-16
SLIDE 16

Big MAC Attack

c c c

Making BFT systems tolerate Byzantine failures [NSDI 2009]

slide-17
SLIDE 17

c c c

Big MAC Attack

Making BFT systems tolerate Byzantine failures [NSDI 2009]

slide-18
SLIDE 18

c c c c c c

Big MAC Attack

c

Faulty Client

c c

Faulty Primary

c

slide-19
SLIDE 19

A More Perfect Separation

Execution Order Authentication Request Valid request Ordered request

slide-20
SLIDE 20

A More Perfect Union

Execution Authentication Order

slide-21
SLIDE 21

Speculating

Command

Voter

Execution

Order Zyzzyva (Kotla et al 2003)

slide-22
SLIDE 22

Misunderspeculation

Speculation is a good idea Speculative execution is a bad idea

if wrong, lots of work it’ s not about execution, anyway

slide-23
SLIDE 23

Misunderspeculation

Speculation is a good idea Speculative execution is a bad idea

if wrong, lots of work it’ s not about execution, anyway

Command

UpRight

speculative ordering execution nodes never roll back

Authentication Execution

slide-24
SLIDE 24

Revisiting conventional wisdom

Failure model BFT implementation API

slide-25
SLIDE 25

API

Old World Order UpRight World Order App

  • execute
  • loadCP
  • takeCP

Library

  • result
  • returnCP

App

  • execute

Library

  • result
slide-26
SLIDE 26

Case Study: Make HDFS UpRight

Users NameNode Data Nodes

Store blocks

Map files to blocks Map blocks to data nodes

slide-27
SLIDE 27

What was required?

Make execution deterministic ~150 lines of code Make checkpoints deterministic and complete ~1500 lines of code That’ s it.

slide-28
SLIDE 28

Do DataNodes Need the UpRight Treatment?

Primary

Users NameNode Data Nodes

UpRight UpRight UpRight UpRight

slide-29
SLIDE 29

Primary

Users NameNode Data Nodes

UpRight

block hash

Modified DataNode

slide-30
SLIDE 30

Primary

Users NameNode Data Nodes

UpRight

hash hash hash hash

Modified DataNode

slide-31
SLIDE 31

Modified DataNode

Primary

Users NameNode Data Nodes

UpRight

hash block

slide-32
SLIDE 32

HDFS LOC Changes

NameNode Execution NameNode Checkpoints DataNode Protocol ~150 ~1500 ~900

HDFS: ~37k LOC total

slide-33
SLIDE 33

HDFS Evaluation

Amazon S3 small instances 50 clients each client writes/reads 1 GB file 50 data nodes

HDFS configuration Authentication / Order / NameNodes DataNode replication factor Original HDFS

  • / - / 1

3 CFT HDFS (u=1,r=0) 3 / 3 / 3 3 BFT HDFS (u=1,r=1) 4 / 4 / 3 3

slide-34
SLIDE 34

HDFS Throughput

HDFS CFT HDFS BFT HDFS

200 400 600 800 1,000 Write Read Throughput (MB/s)

slide-35
SLIDE 35

HDFS Computational Costs

UpRight Core Data Node Name Node

200 400 600 800 1,000 1,200 HDFS CFT_HDFS BFT_HDFS HDFS CFT_HDFS BFT_HDFS Mcycles/GB Write Read

slide-36
SLIDE 36

This talk

BFT in real systems ZooKeeper, Hadoop Distributed File System What it took UpRight BFT Implementation API

slide-37
SLIDE 37

What the future holds

The plural of “anecdote” is not “data”

Quantify the risks

how frequently do Byzantine failures occur? how much damage can they do?

Quantify the benefits

what fraction of these failures does BFT mask?

slide-38
SLIDE 38

Matrix signatures

(Aiyer et al 2008)

Order

c c c c c c c Separate order from authentication

slide-39
SLIDE 39

Matrix signatures

(Aiyer et al 2008)

c

slide-40
SLIDE 40

Matrix signatures

(Aiyer et al 2008)

c

Primary orders request if sufficiently many valid MACs

slide-41
SLIDE 41

Validity: request is from client

Matrix signatures

(Aiyer et al 2008)

c

Primary orders request if sufficiently many valid MACs

n ≥ r + 1

slide-42
SLIDE 42

Validity: request is from client Transitive validity: convince others

Matrix signatures

(Aiyer et al 2008)

c

Primary orders request if sufficiently many valid MACs

n ≥ 2r + 1 n ≥ r + 1

slide-43
SLIDE 43

Validity: request is from client Transitive validity: convince others Liveness: request will go through

Matrix signatures

(Aiyer et al 2008)

c

Primary orders request if sufficiently many valid MACs

n ≥ 2r + u + 1 n ≥ 2r + 1 n ≥ r + 1