The OceanStore Write Path Sean C. Rhea John Kubiatowicz University - - PowerPoint PPT Presentation

the oceanstore write path
SMART_READER_LITE
LIVE PREVIEW

The OceanStore Write Path Sean C. Rhea John Kubiatowicz University - - PowerPoint PPT Presentation

The OceanStore Write Path Sean C. Rhea John Kubiatowicz University of California, Berkeley June 11, 2002 Introduction: the OceanStore Write Path Introduction: the OceanStore Write Path The Inner Ring Acts as the single point of


slide-1
SLIDE 1

The OceanStore Write Path

Sean C. Rhea John Kubiatowicz University of California, Berkeley June 11, 2002

slide-2
SLIDE 2

Introduction: the OceanStore Write Path

slide-3
SLIDE 3

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file

slide-4
SLIDE 4

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file – Performs write access control, serialization – Creates archival fragments of new data and disperses them

slide-5
SLIDE 5

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file – Performs write access control, serialization – Creates archival fragments of new data and disperses them – Certifies the results of its actions with cryptography

slide-6
SLIDE 6

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file – Performs write access control, serialization – Creates archival fragments of new data and disperses them – Certifies the results of its actions with cryptography

  • The Second Tier

– Caches certificates and data produced at the inner ring – Self-organizes into an dissemination tree to share results

slide-7
SLIDE 7

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file – Performs write access control, serialization – Creates archival fragments of new data and disperses them – Certifies the results of its actions with cryptography

  • The Second Tier

– Caches certificates and data produced at the inner ring – Self-organizes into an dissemination tree to share results

  • The Archival Storage Servers

– Store archival fragments generated in the Inner Ring

slide-8
SLIDE 8

Introduction: the OceanStore Write Path

  • The Inner Ring

– Acts as the single point of consistency for a file – Performs write access control, serialization – Creates archival fragments of new data and disperses them – Certifies the results of its actions with cryptography

  • The Second Tier

– Caches certificates and data produced at the inner ring – Self-organizes into an dissemination tree to share results

  • The Archival Storage Servers

– Store archival fragments generated in the Inner Ring

  • The Client Machines

– Create updates and send them to the inner ring – Wait for responses to come down the dissemination tree

1

slide-9
SLIDE 9

Introduction: the OceanStore Write Path (con’t)

App

req

T Time Replica Replica Replica App Archive Inner Ring

  • 1. A client sends an update to the inner ring

2

slide-10
SLIDE 10

Introduction: the OceanStore Write Path (con’t)

App T req

agree

T Time Replica Replica Replica Archive Inner Ring App

  • 1. A client sends an update to the inner ring
  • 2. The inner ring performs a Byzantine agreement, applying the update

3

slide-11
SLIDE 11

Introduction: the OceanStore Write Path (con’t)

App T req Tagree Time

disseminate

T Replica Replica Replica App Archive Inner Ring

  • 1. A client sends an update to the inner ring
  • 2. The inner ring performs a Byzantine agreement, applying the update
  • 3. The results are sent down the dissemination tree and into the archive

4

slide-12
SLIDE 12

Write Path Details

  • Inner Ring uses Byzantine agreement for fault tolerance

– Up to f of 3f + 1 servers can fail – We use a modified version of the Castro-Liskov protocol

slide-13
SLIDE 13

Write Path Details

  • Inner Ring uses Byzantine agreement for fault tolerance

– Up to f of 3f + 1 servers can fail – We use a modified version of the Castro-Liskov protocol

  • Inner Ring certifies decisions with proactive threshold signatures

– Single public (verification) key – Each member has a key share which lets it generate signature shares – Need f + 1 signature shares to generate full signature – Independent sets of key shares can be used to control membership

slide-14
SLIDE 14

Write Path Details

  • Inner Ring uses Byzantine agreement for fault tolerance

– Up to f of 3f + 1 servers can fail – We use a modified version of the Castro-Liskov protocol

  • Inner Ring certifies decisions with proactive threshold signatures

– Single public (verification) key – Each member has a key share which lets it generate signature shares – Need f + 1 signature shares to generate full signature – Independent sets of key shares can be used to control membership

  • Second Tier and Archive are ignorant of composition of Inner Ring

– Know only the single public key – Allows simple replacement of faulty Inner Ring servers

5

slide-15
SLIDE 15

Micro Benchmarks: Update Latency vs. Update Size

20 40 60 80 100 120 4 8 12 16 20 24 28 32 20 40 60 80 100 120 140 Latency (ms)

  • Update Size (kB)

slope = 0.6 s/MB slope = 0.6 s/MB 1024 bit keys 512 bit keys

  • Use two key sizes to show effects of Moore’s Law on latency

– 512 bit keys are not secure, but are 4× faster – Gives an upper bound on latency three years from now

6

slide-16
SLIDE 16

Micro Benchmarks: Update Latency Remarks

  • Threshold signatures are expensive

– Takes 6.3 ms to generate regular 1024 bit signature – But takes 73.9 ms to generate 1024 bit threshold signature share – (Combining shares takes less than 1 ms)

slide-17
SLIDE 17

Micro Benchmarks: Update Latency Remarks

  • Threshold signatures are expensive

– Takes 6.3 ms to generate regular 1024 bit signature – But takes 73.9 ms to generate 1024 bit threshold signature share – (Combining shares takes less than 1 ms)

  • Unfortunately, this is a mathematical fact of life

– Cannot use Chinese Remainder Theorem in computing shares (4×) – Making individual shares verifiable is expensive

slide-18
SLIDE 18

Micro Benchmarks: Update Latency Remarks

  • Threshold signatures are expensive

– Takes 6.3 ms to generate regular 1024 bit signature – But takes 73.9 ms to generate 1024 bit threshold signature share – (Combining shares takes less than 1 ms)

  • Unfortunately, this is a mathematical fact of life

– Cannot use Chinese Remainder Theorem in computing shares (4×) – Making individual shares verifiable is expensive

  • Almost no research into performance of threshold cryptography

7

slide-19
SLIDE 19

Micro Benchmarks: Throughput vs. Update Size

10 20 30 40 50 60 70 80 2 8 32 128 512 2048 1 2 3 4 5 6 7 Total Update Operations per Second

  • Total Bandwidth (MB/s)

Size of Update (kB) Ops/s MB/s

  • Using 1024 bit keys, 60 synchronous clients
  • Max throughput is a respectable 5 MB/s

– Berkeley DB through Java can only do about 7.5 MB/s

slide-20
SLIDE 20

Micro Benchmarks: Throughput vs. Update Size

10 20 30 40 50 60 70 80 2 8 32 128 512 2048 1 2 3 4 5 6 7 Total Update Operations per Second

  • Total Bandwidth (MB/s)

Size of Update (kB) Ops/s MB/s

  • Using 1024 bit keys, 60 synchronous clients
  • Max throughput is a respectable 5 MB/s

– Berkeley DB through Java can only do about 7.5 MB/s

  • But we have a problem with small updates

– 13 ops/s is atrocious!

8

slide-21
SLIDE 21

Batching: A Solution to the Small Update Problem

  • What if we could combine many small updates into a single batch?
slide-22
SLIDE 22

Batching: A Solution to the Small Update Problem

  • What if we could combine many small updates into a single batch?
  • Each Inner Ring member

– Decides result of each update individually – Generates a signature share over the results of all of the updates

slide-23
SLIDE 23

Batching: A Solution to the Small Update Problem

  • What if we could combine many small updates into a single batch?
  • Each Inner Ring member

– Decides result of each update individually – Generates a signature share over the results of all of the updates

  • Saves CPU time

– Generating signature shares is expensive

slide-24
SLIDE 24

Batching: A Solution to the Small Update Problem

  • What if we could combine many small updates into a single batch?
  • Each Inner Ring member

– Decides result of each update individually – Generates a signature share over the results of all of the updates

  • Saves CPU time

– Generating signature shares is expensive

  • Saves network bandwidth

– Each Byzantine agreement requires O(ringsize2) messages

slide-25
SLIDE 25

Batching: A Solution to the Small Update Problem

  • What if we could combine many small updates into a single batch?
  • Each Inner Ring member

– Decides result of each update individually – Generates a signature share over the results of all of the updates

  • Saves CPU time

– Generating signature shares is expensive

  • Saves network bandwidth

– Each Byzantine agreement requires O(ringsize2) messages

  • But makes signatures unwieldy

– Each signature is now O(batchsize) long – For high throughput, we want batch sizes in the hundreds or thousands

9

slide-26
SLIDE 26

Merkle Trees: Making Batching Efficient

H Result 1

8

H4 H9 Result 2 H5 H2 H1 H3 H15 Key: Sign: H , H

+1) 2

= SHA1 (H

2

) H1 =15, (n

i i i

Result 15 Path 2

  • Build a Merkle Tree over results

– Each node is a hash of it’s two children

slide-27
SLIDE 27

Merkle Trees: Making Batching Efficient

H Result 1

8

H4 H9 Result 2 H5 H2 H1 H3 H15 Key: Sign: H , H

+1) 2

= SHA1 (H

2

) H1 =15, (n

i i i

Result 15 Path 2

  • Build a Merkle Tree over results

– Each node is a hash of it’s two children

  • Sign only the tree size and the top hash

– To verify Result 2, need only signature plus

slide-28
SLIDE 28

Merkle Trees: Making Batching Efficient

H Result 1

8

H4 H9 Result 2 H5 H2 H1 H3 H15 Key: Sign: H , H

+1) 2

= SHA1 (H

2

) H1 =15, (n

i i i

Result 15 Path 2

  • Build a Merkle Tree over results

– Each node is a hash of it’s two children

  • Sign only the tree size and the top hash

– To verify Result 2, need only signature plus – Signature over any one result is only O(log batchsize)

slide-29
SLIDE 29

Merkle Trees: Making Batching Efficient

H Result 1

8

H4 H9 Result 2 H5 H2 H1 H3 H15 Key: Sign: H , H

+1) 2

= SHA1 (H

2

) H1 =15, (n

i i i

Result 15 Path 2

  • Build a Merkle Tree over results

– Each node is a hash of it’s two children

  • Sign only the tree size and the top hash

– To verify Result 2, need only signature plus – Signature over any one result is only O(log batchsize) – Provably secure

10

slide-30
SLIDE 30

Micro Benchmarks: Throughput vs. Update Size

10 20 30 40 50 60 70 80 2 8 32 128 512 2048 1 2 3 4 5 6 7 Total Update Operations per Second

  • Total Bandwidth (MB/s)

Size of Update (kB) Ops/s MB/s

  • Using 1024 bit keys
  • Max throughput is a respectable 5 MB/s

– Berkeley DB through Java can only do about 7.5 MB/s

  • But we have a problem with small updates

– 13 ops/s is atrocious!

11

slide-31
SLIDE 31

Micro Benchmarks: Throughput vs. Update Size (w/ Batching)

10 20 30 40 50 60 70 80 2 8 32 128 512 2048 1 2 3 4 5 6 7 Total Update Operations per Second

  • Total Bandwidth (MB/s)

Size of Update (kB) Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching

  • Batching works great

– Amortizes expensive agreements over many updates – For small updates, go from 13.5 ops/s to 76 ops/s

slide-32
SLIDE 32

Micro Benchmarks: Throughput vs. Update Size (w/ Batching)

10 20 30 40 50 60 70 80 2 8 32 128 512 2048 1 2 3 4 5 6 7 Total Update Operations per Second

  • Total Bandwidth (MB/s)

Size of Update (kB) Ops/s, No Batching MB/s, No Batching Ops/s, Naive Batching MB/s, Naive Batching

  • Batching works great

– Amortizes expensive agreements over many updates – For small updates, go from 13.5 ops/s to 76 ops/s – Introspecting on batch size should further improve small update tput

12

slide-33
SLIDE 33

Macro Benchmarks: The Andrew Benchmark

Tapestry Msgs

JVM

fopen fread fwrite etc. etc. READ WRITE GETATTR OSCreate OSUpdate OSRead

Andrew Benchmark UL NFS Client Interface Replica Tapestry Linux Kernel Network

  • Built a UNIX file system on top of OceanStore

– Runs as a user-level NFS daemon on Linux – Application’s use familiar fopen, fwrite, etc. No recompilation. – Kernel translates to NFS requests and sends to local daemon – Daemon translates to OceanStore requests and sends out on network

13

slide-34
SLIDE 34

Macro Benchmarks: The Andrew Benchmark

Destination Source

  • U. TX

GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA – 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech – – 27.7 (2.2) 59.0 (0.20) Rice – – – 61.5 (0.69) Inter-host ping times in milliseconds

  • For more realism, we used a nationwide network

– Find out whether Byzantine agreement is practical in wide area

  • Ran the Andrew Benchmark

– Simulates software development workload

slide-35
SLIDE 35

Macro Benchmarks: The Andrew Benchmark

Destination Source

  • U. TX

GA Tech Rice UW UCB 45.3 (0.75) 56.5 (0.14) 49.6 (3.1) 20.0 (0.11) UTA – 24.1 (0.49) 8.45 (1.5) 61.7 (0.22) GA Tech – – 27.7 (2.2) 59.0 (0.20) Rice – – – 61.5 (0.69) Inter-host ping times in milliseconds

  • For more realism, we used a nationwide network

– Find out whether Byzantine agreement is practical in wide area

  • Ran the Andrew Benchmark

– Simulates software development workload

  • For control, used several competitors

– Linux user-level NFS daemon: real NFS, ships with Debian GNU/Linux – Java-based user-level NFS daemon: uses disk (not OceanStore)

14

slide-36
SLIDE 36

Macro Benchmarks: Local Andrew

Linux NFS Java 512 Simple 512 Batching + Tentative 1024 Simple

20 40 60 80

Time (s)

  • Phase 5: Compile Source Tree

Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories

  • Simple OceanStore performance not so hot

– In the local area, NFS is in its element; OceanStore isn’t

slide-37
SLIDE 37

Macro Benchmarks: Local Andrew

Linux NFS Java 512 Simple 512 Batching + Tentative 1024 Simple

20 40 60 80

Time (s)

  • Phase 5: Compile Source Tree

Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories

  • Simple OceanStore performance not so hot

– In the local area, NFS is in its element; OceanStore isn’t

  • But with tentative update support and batching, OceanStore pretty good

– Tentative updates let client go on while waiting for agreements – Batching allows inner ring to keep up – Within a factor of two of Java-based NFS

15

slide-38
SLIDE 38

Macro Benchmarks: Nationwide Andrew

Linux NFS 512 Simple 1024 Simple

100 200 300

Time (s)

  • Phase 5: Compile Source Tree

Phase 4: Read All Files Phase 3: Stat All Files Phase 2: Copy Source Tree Phase 1: Create Directories

  • In the wide area, OceanStore is its element; NFS isn’t

– Even simple OceanStore is nearly within a factor of two – Numbers with batching and tentative updates forthcoming – Should outperform NFS

16

slide-39
SLIDE 39

Conclusion

  • All the basics of the OceanStore write path implemented and working

– Not doing full recovery yet

slide-40
SLIDE 40

Conclusion

  • All the basics of the OceanStore write path implemented and working

– Not doing full recovery yet

  • Performance is good

– Single update time under 100 ms, improves directly with Moore’s Law

slide-41
SLIDE 41

Conclusion

  • All the basics of the OceanStore write path implemented and working

– Not doing full recovery yet

  • Performance is good

– Single update time under 100 ms, improves directly with Moore’s Law – Throughput great for large updates

slide-42
SLIDE 42

Conclusion

  • All the basics of the OceanStore write path implemented and working

– Not doing full recovery yet

  • Performance is good

– Single update time under 100 ms, improves directly with Moore’s Law – Throughput great for large updates – Batching allows inner ring to amortize signatures over many updates ∗ Get large update throughput with small updates ∗ Secure and space-efficient

slide-43
SLIDE 43

Conclusion

  • All the basics of the OceanStore write path implemented and working

– Not doing full recovery yet

  • Performance is good

– Single update time under 100 ms, improves directly with Moore’s Law – Throughput great for large updates – Batching allows inner ring to amortize signatures over many updates ∗ Get large update throughput with small updates ∗ Secure and space-efficient

  • Provides a lot more functionality than competition

– Higher durability and availability than NFS – Cryptographic data integrity – Versioning allows logical “undo”

17