Prophecy : Using History for High Throughput Fault Tolerance - - PowerPoint PPT Presentation

prophecy using history for high throughput fault tolerance
SMART_READER_LITE
LIVE PREVIEW

Prophecy : Using History for High Throughput Fault Tolerance - - PowerPoint PPT Presentation

Prophecy : Using History for High Throughput Fault Tolerance Siddhartha Sen Joint work with Wyatt Lloyd and Mike Freedman Princeton University Non crash failures happen Non crash failures happen Non crash failures happen Non crash


slide-1
SLIDE 1

Prophecy: Using History for High‐ Throughput Fault Tolerance

Siddhartha Sen

Joint work with Wyatt Lloyd and Mike Freedman Princeton University

slide-2
SLIDE 2

Non‐crash failures happen Non crash failures happen

slide-3
SLIDE 3

Non‐crash failures happen Non crash failures happen

Model as Byzantine (malicious) (malicious)

slide-4
SLIDE 4

Mask Byzantine faults Mask Byzantine faults

Clients Service

slide-5
SLIDE 5

Mask Byzantine faults Mask Byzantine faults

Throughput Throughput

R li t d i Clients Replicated service

slide-6
SLIDE 6

Mask Byzantine faults Mask Byzantine faults

Throughput Throughput

R li t d i Clients Replicated service

slide-7
SLIDE 7

Mask Byzantine faults Mask Byzantine faults

Throughput Throughput

R li t d i Clients Replicated service

slide-8
SLIDE 8

Mask Byzantine faults Mask Byzantine faults

Throughput Throughput

R li t d i Clients Replicated service

slide-9
SLIDE 9

Mask Byzantine faults Mask Byzantine faults

Throughput Throughput

Linearizability ( t i t ) (strong consistency)

R li t d i Clients Replicated service

slide-10
SLIDE 10

Byzantine fault tolerance (BFT) Byzantine fault tolerance (BFT)

  • Low throughput

Low throughput difi li

  • Modifies clients
  • Long‐lived sessions
slide-11
SLIDE 11

Prophecy Prophecy

  • High throughput + good consistency

High throughput + good consistency f l h

  • No free lunch:

– Read‐mostly workloads – Slightly weakened consistency

slide-12
SLIDE 12

Byzantine fault tolerance (BFT) Byzantine fault tolerance (BFT)

  • Low throughput

D Prophecy

Low throughput difi li

D‐Prophecy

  • Modifies clients

Prophecy

  • Long‐lived sessions
slide-13
SLIDE 13

Traditional BFT reads Traditional BFT reads

application application

Clients Replica Group

slide-14
SLIDE 14

Traditional BFT reads Traditional BFT reads

application application Agree?

Clients Replica Group

slide-15
SLIDE 15

A cache solution A cache solution

application cache application

Clients Replica Group

slide-16
SLIDE 16

A cache solution A cache solution

application cache application Agree?

Clients Replica Group

slide-17
SLIDE 17

A cache solution A cache solution

application cache application

Problems:

Agree?

  • Huge cache
  • Invalidation

Clients

Invalidation

Replica Group

slide-18
SLIDE 18

A compact cache A compact cache

application cache application

Requests Responses req1 resp1 req2 resp2

Clients

req3 resp3

Replica Group

slide-19
SLIDE 19

A compact cache A compact cache

application cache application

Requests Responses sketch(req1) sketch(resp1) sketch(req2) sketch(resp2)

Clients

sketch(req3) sketch(resp3)

Replica Group

slide-20
SLIDE 20

A sketcher A sketcher

application sketcher application

Clients Replica Group

slide-21
SLIDE 21

Executing a read Executing a read

sketch webpage

…… …… …… ……

Clients

…… ……

Replica Group

slide-22
SLIDE 22

Executing a read Executing a read

sketch webpage

…… …… …… ……

Clients

…… ……

Replica Group

slide-23
SLIDE 23

Executing a read Executing a read

sketch webpage

…… …… …… …… ……

Clients

…… …… ……

Replica Group

slide-24
SLIDE 24

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Clients

…… …… ……

Replica Group

slide-25
SLIDE 25

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Fast, load‐balanced reads

Clients

…… …… ……

Replica Group

slide-26
SLIDE 26

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Clients

…… …… ……

Replica Group

slide-27
SLIDE 27

Executing a read Executing a read

sketch webpage

…… …… …… ……

Clients

…… ……

Replica Group

slide-28
SLIDE 28

Executing a read Executing a read

sketch webpage

…… ……

key‐value store

…… ……

replicated state machine Clients

…… ……

Replica Group

slide-29
SLIDE 29

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-30
SLIDE 30

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-31
SLIDE 31

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-32
SLIDE 32

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Clients

…… …… ……

Replica Group

slide-33
SLIDE 33

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Maintain a fresh cache

Clients

……

fresh cache

…… ……

Replica Group

slide-34
SLIDE 34

Did hi li i bilit ? Did we achieve linearizability? NO!

slide-35
SLIDE 35

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-36
SLIDE 36

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-37
SLIDE 37

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Clients

…… …… ……

Replica Group

slide-38
SLIDE 38

Executing a read Executing a read

sketch webpage

…… …… …… …… …… ……

Clients

…… ……

Replica Group

slide-39
SLIDE 39

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Clients

…… …… ……

Replica Group

slide-40
SLIDE 40

Executing a read Executing a read

sketch webpage

…… ……

Agree?

…… …… ……

Fast reads may be stale

Clients

……

may be stale

…… ……

Replica Group

slide-41
SLIDE 41

Load balancing Load balancing

sketch webpage

…… …… …… ……

Clients

…… ……

Replica Group

slide-42
SLIDE 42

Load balancing Load balancing

sketch webpage

…… ……

Agree?

…… …… …… ……

Clients

…… ……

Replica Group

slide-43
SLIDE 43

Load balancing Load balancing

sketch webpage

…… ……

Agree?

…… …… …… ……

Pr(k stale) = gk

Clients

…… ……

Replica Group

slide-44
SLIDE 44

D‐Prophecy vs. BFT D Prophecy vs. BFT

Traditional BFT:

  • Each replica executes read

p

  • Linearizability

D‐Prophecy:

  • One replica executes read

Clients

  • “Delay‐once” linearizability

Replica Group

slide-45
SLIDE 45

Byzantine fault tolerance (BFT) Byzantine fault tolerance (BFT)

  • Low throughput

D Prophecy

Low throughput difi li

D‐Prophecy

  • Modifies clients

Prophecy

  • Long‐lived sessions
slide-46
SLIDE 46

Key‐exchange overhead Key exchange overhead

slide-47
SLIDE 47

Key‐exchange overhead Key exchange overhead

11%

slide-48
SLIDE 48

Key‐exchange overhead Key exchange overhead

11% 3%

slide-49
SLIDE 49

Internet services Internet services

Cli Clients Replica Group

slide-50
SLIDE 50

A proxy solution A proxy solution

Sketcher

Cli

Proxy

Clients Replica Group

slide-51
SLIDE 51

A proxy solution A proxy solution

C lid t Consolidate sketchers

Sketcher

Cli

Proxy

Clients Replica Group

slide-52
SLIDE 52

A proxy solution A proxy solution

C lid t Consolidate sketchers

Sketcher

Cli Clients Replica Group

slide-53
SLIDE 53

A proxy solution A proxy solution

Sk t h t Sketcher must be fail‐stop

Sketcher

Cli

d

Clients

Trusted

Replica Group

slide-54
SLIDE 54

A proxy solution

Sketcher must be fail‐stop

A proxy solution

iddl b l d

Sketcher must be fail stop

  • Trust middlebox already
  • Small and simple

Sketcher

Cli

d

Clients

Trusted

Replica Group

slide-55
SLIDE 55

Executing a read Executing a read

…… …… ……

q

……

Sketcher

Cli

d

q

Clients

Trusted

…… ……

Replica Group

slide-56
SLIDE 56

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

Clients

Trusted

…… ……

Replica Group

slide-57
SLIDE 57

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Replica Group

slide-58
SLIDE 58

Executing a read Executing a read

…… …… ……

Sketcher

……

Cli

d

…… ……

Req Resp

( )

Clients

Trusted

…… ……

s(q) ⋅⋅⋅ ⋅⋅⋅

Replica Group

slide-59
SLIDE 59

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

Clients

Trusted

…… ……

Replica Group

slide-60
SLIDE 60

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

Clients

Trusted

…… ……

Replica Group

slide-61
SLIDE 61

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Replica Group

slide-62
SLIDE 62

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Req Resp

( )

Replica Group

s(q) ⋅⋅⋅ ⋅⋅⋅

slide-63
SLIDE 63

Executing a read Executing a read

…… …… …… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Req Resp

( )

Replica Group

s(q) ⋅⋅⋅ ⋅⋅⋅

slide-64
SLIDE 64

Prophecy Prophecy

…… …… …… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Replica Group

slide-65
SLIDE 65

Prophecy Prophecy

…… ……

Fast, load‐balanced reads

…… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Replica Group

slide-66
SLIDE 66

Prophecy Prophecy

…… ……

Fast reads may be stale

…… ……

Sketcher

Cli

d

…… ……

Clients

Trusted

…… ……

Req Resp

( )

Replica Group

s(q) ⋅⋅⋅ ⋅⋅⋅

slide-67
SLIDE 67

Delay‐once linearizability Delay once linearizability

slide-68
SLIDE 68

Delay‐once linearizability Delay once linearizability

slide-69
SLIDE 69

Delay‐once linearizability Delay once linearizability

〈 W, R, W, W, R, R, W, R 〉

slide-70
SLIDE 70

Delay‐once linearizability Delay once linearizability

Read‐after‐write property

〈 W, R, W, W, R, R, W, R 〉

slide-71
SLIDE 71

Delay‐once linearizability Delay once linearizability

Read‐after‐write property

〈 W, R, W, W, R, R, W, R 〉

slide-72
SLIDE 72

Example application Example application

  • Upload embarrassing photos

Upload embarrassing photos

  • 1. Remove colleagues from ACL

2 Upload photos

  • 2. Upload photos
  • 3. (Refresh)
  • Weak may reorder
  • Delay‐once preserves order
slide-73
SLIDE 73

Byzantine fault tolerance (BFT) Byzantine fault tolerance (BFT)

  • Low throughput

D Prophecy

Low throughput difi li

D‐Prophecy

  • Modifies clients

Prophecy

  • Long‐lived sessions
slide-74
SLIDE 74

Implementation Implementation

  • Modified PBFT

Modified PBFT

– PBFT is stable, complete Competitive with Zyzzyva et al – Competitive with Zyzzyva et. al.

  • C++, Tamer async I/O

– Sketcher: ∼2000 LOC – PBFT library: ∼1140 LOC – PBFT client: ∼1000 LOC

slide-75
SLIDE 75

Evaluation Evaluation

  • Prophecy vs proxied‐PBFT

Prophecy vs. proxied PBFT

– Proxied systems

  • D‐Prophecy vs. PBFT

– Non‐proxied systems

slide-76
SLIDE 76

Evaluation Evaluation

  • Prophecy vs proxied‐PBFT

Prophecy vs. proxied PBFT

– Proxied systems

  • We will study:

– Performance on “null” workloads – Performance on null workloads – Performance with real replicated service Where system bottlenecks how to scale – Where system bottlenecks, how to scale

slide-77
SLIDE 77

Basic setup Basic setup

Sk t h Sketcher

Clients

(concurrent)

Clients (100) Replica Group (PBFT)

slide-78
SLIDE 78
slide-79
SLIDE 79

Fraction of failed Fraction of failed fast reads

slide-80
SLIDE 80

Fraction of failed Alexa top sites: Fraction of failed fast reads Alexa top sites: < 15%

slide-81
SLIDE 81

Small benefit on null reads Small benefit on null reads

slide-82
SLIDE 82

Small benefit on null reads Small benefit on null reads

slide-83
SLIDE 83

Apache webserver setup Apache webserver setup

Sk t h Sketcher

Clients Clients Replica Group

slide-84
SLIDE 84

Large benefit on real workload Large benefit on real workload

slide-85
SLIDE 85

Large benefit on real workload Large benefit on real workload

3.7x

slide-86
SLIDE 86

Large benefit on real workload Large benefit on real workload

3.7x 2.0x

slide-87
SLIDE 87

Large benefit on real workload Large benefit on real workload

3.7x 2.0x

slide-88
SLIDE 88

Benefit grows with work Benefit grows with work

slide-89
SLIDE 89

Benefit grows with work Benefit grows with work

slide-90
SLIDE 90

Benefit grows with work Benefit grows with work

slide-91
SLIDE 91

Benefit grows with work Benefit grows with work

94μs (Apache)

slide-92
SLIDE 92

Benefit grows with work Benefit grows with work

94μs (Apache) Null workloads are misleading! are misleading!

slide-93
SLIDE 93

Benefit grows with work Benefit grows with work

slide-94
SLIDE 94

Single sketcher bottlenecks Single sketcher bottlenecks

slide-95
SLIDE 95

Single sketcher bottlenecks Single sketcher bottlenecks

slide-96
SLIDE 96

Scaling out Scaling out

slide-97
SLIDE 97

Scales linearly with replicas Scales linearly with replicas

slide-98
SLIDE 98

Summary Summary

  • Prophecy good for Internet services

p y g

– Fast, load‐balanced reads

  • D Prophecy good for traditional services
  • D‐Prophecy good for traditional services
  • Prophecy scales linearly while PBFT stays flat

Prophecy scales linearly while PBFT stays flat

  • Limitations:

– Read‐mostly workloads (meas. study corroborates) – Delay‐once linearizability (useful for many apps)

slide-99
SLIDE 99

Thank You

slide-100
SLIDE 100

Additional slides

slide-101
SLIDE 101

Transitions Transitions

  • Prophecy good for read‐mostly workloads

Prophecy good for read mostly workloads i i i i ?

  • Are transitions rare in practice?
slide-102
SLIDE 102

Measurement study Measurement study

  • Alexa top sites

Alexa top sites

  • Access main page every 20 sec for 24 hrs
slide-103
SLIDE 103

Mostly static content Mostly static content

slide-104
SLIDE 104

Mostly static content Mostly static content

slide-105
SLIDE 105

Mostly static content Mostly static content

15%

slide-106
SLIDE 106

Dynamic content Dynamic content

  • Rabin fingerprinting on transitions

Rabin fingerprinting on transitions

  • 43% differ by single contiguous change
  • 43% differ by single contiguous change
  • Sampled 4000 of them, over half due to:

– Load balancing directives – Random IDs in links, function parameters