[PDF] - [D YNAMO & G OOGLE F ILE S YSTEM ] Shrideep Pallickara Computer PDF Document

SLIDE 1

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.1

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS 555: DISTRIBUTED SYSTEMS

[DYNAMO & GOOGLE FILE SYSTEM]

Shrideep Pallickara Computer Science Colorado State University

November 14, 2019

L24.1 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.2 Professor: SHRIDEEP PALLICKARA

Frequently asked questions from the previous class survey

November 14, 2019

¨ Are there entries for each virtual node (and the range it manages) in

the zero-hop DHT?

SLIDE 2

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.2

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.3 Professor: SHRIDEEP PALLICKARA

Topics covered in this lecture

¨ Dynamo ¤ Quorums and consistency ¨ Google File System

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

DYNAMO QUORUMS AND CONSISTENCY

November 14, 2019

L24.4

SLIDE 3

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.3

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.5 Professor: SHRIDEEP PALLICKARA

A client must specify which version it is updating

¨ Pass context from an earlier read operation ¤ Context contains vector clock information ¨ Requests with branches that cannot be reconciled? ¤ Returns all objects with versioning info in context ¤ Update done using this context reconciles and collapses all branches

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.6 Professor: SHRIDEEP PALLICKARA

Execution of get() and put() operations

November 14, 2019

¨ Read and write operations involve the first N healthy nodes ¨ During failures, nodes lower in priority are accessed

SLIDE 4

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.4

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.7 Professor: SHRIDEEP PALLICKARA

To maintain consistency, Dynamo uses a quorum protocol

¨ Uses configurable settings for replicas that must participate in ¤ Reads ¤ Writes

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.8 Professor: SHRIDEEP PALLICKARA

Quorum-based protocols: When there are N replicas

¨ Read quorum NR ¨ To modify a file write-quorum NW ¨ NR + NW > N ¤ Prevent read-write conflict ¨ NW > N/2 ¤ Prevent write-write conflict

November 14, 2019

SLIDE 5

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.5

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.9 Professor: SHRIDEEP PALLICKARA

A B C D E F G H I J K L

Quorum-based protocols: Example

NR=3 NW=10

A B C D E F G H I J K L

NR=7 NW=6

J

L

Write-write conflict

Concurrent writes to {A, B, C, E, F, G} and {D, H, I, J, K, L} will be accepted Read Quorum: Write Quorum:

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.10 Professor: SHRIDEEP PALLICKARA

Upon receiving a put() request for a key

¨ Coordinator generates a vector clock for new version ¤ Sends new version to N highest-ranked reachable nodes ¤ If at least NW-1 nodes respond: write is successful!

November 14, 2019

SLIDE 6

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.6

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.11 Professor: SHRIDEEP PALLICKARA

External Discovery: During node adds

¨ When A and B join, it might be a while before they know each other’s

existence

¤ Logical partitioning ¨ Use seed nodes that are known to all nodes ¤ All nodes reconcile membership with seed

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

DYNAMO: EXPERIENCES

November 14, 2019

L24.12

SLIDE 7

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.7

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.13 Professor: SHRIDEEP PALLICKARA

Popular reconciliation strategies

¨ Business logic specific ¨ Timestamp ¤ Last write wins ¨ High performance read engine ¤ High read rates ¤ Small update rates n NR=1 and NW=N

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.14 Professor: SHRIDEEP PALLICKARA

A B C D E F G H I J K L

Quorum-based protocols: Example 2

NR=1 NW=12

J

Read Quorum: Write Quorum:

November 14, 2019

SLIDE 8

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.8

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.15 Professor: SHRIDEEP PALLICKARA

Common configuration of the quorum

¨ NR=2 ¨ NW=2 ¨ N=3

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.16 Professor: SHRIDEEP PALLICKARA

Balancing performance and durability

November 14, 2019

¨ Some services not happy with 300 ms SLA ¤ Writes tend to be slower than reads ¨ To cope with this, nodes maintain object buffer ¤ Main memory ¤ Periodically written to storage

SLIDE 9

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.9

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

THE GOOGLE FILE SYSTEM

SANJAY GHEMAWAT, HOWARD GOBIOFF, SHUN-TAK LEUNG: The Google file system. Proceedings of SOSP 2003: 29-43

November 14, 2019

L24.17 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.18 Professor: SHRIDEEP PALLICKARA

Broad brushstroke themes in current extreme scale storage systems

¨ Voluminous data ¨ Commodity hardware ¨ Distributed Data ¨ Expect failures ¨ Tune for access by applications ¨ Optimize for dominant usage ¨ Tradeoff between consistency and availability

November 14, 2019

SLIDE 10

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.10

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.19 Professor: SHRIDEEP PALLICKARA

Demand pulls in GFS: I

November 14, 2019

¨ Component failures are the norm ¨ Files are huge by traditional standards ¨ File mutations predominantly through appends ¤ Not overwrites ¨ Applications and File system API designed in lock-step

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.20 Professor: SHRIDEEP PALLICKARA

Demand pulls in GFS - II

November 14, 2019

¨ Hundreds of producers will concurrently append to a file ¤ Many-way merging ¨ High sustained bandwidth is more important than low latency

SLIDE 11

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.11

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.21 Professor: SHRIDEEP PALLICKARA

The file system interface

November 14, 2019

¨ Does not implement standard APIs such as POSIX ¨ Supports create

create, delete delete, open

pen, close

close, read read and write write

¨ snapshot

snapshot

¤ Create a fast copy of file and directory tree

¨ record append

record append

¤ Multiple writers can concurrently append records to the same file n Without additional locking

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.22 Professor: SHRIDEEP PALLICKARA

Architecture of GFS

GFS Master

GFS Chunk Server GFS Chunk Server GFS Chunk Server ... Linux File System Linux File System Linux File System

Client Client

...

Client

November 14, 2019

SLIDE 12

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.12

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.23 Professor: SHRIDEEP PALLICKARA

In GFS a file is broken up into fixed-size chunks

November 14, 2019

¨ Obvious reason ¤ The file is too big ¨ Set the stage for computations that operate on this data ¤ Parallel I/O ¤ I/O seek times are 14 x 106 slower than CPU access times Map-Reduce

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.24 Professor: SHRIDEEP PALLICKARA

In GFS a file is broken up into fixed-size chunks

¨ Each chunk has a 64-bit globally unique ID ¤ Assigned by the Master ¨ Chunks are stored by chunk servers ¤ On local disks as LINUX files ¨ Each chunk is replicated ¤ Default is 3

November 14, 2019

SLIDE 13

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.13

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.25 Professor: SHRIDEEP PALLICKARA

Master operations

¨ Manage system metadata ¨ Leasing of chunks ¨ Garbage collection of orphaned chunks ¨ Chunk migrations

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.26 Professor: SHRIDEEP PALLICKARA

ALL system metadata is managed by the Master and stored in Main Memory

① File and chunk namespaces ② Mapping from files to chunks ③ Location of chunks

Logs mutations into a permanent log

November 14, 2019

SLIDE 14

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.14

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.27 Professor: SHRIDEEP PALLICKARA

Why have a single Master?

¨ Vastly simplifies design ¨ Easy to use global knowledge to reason about ¤ Chunk placements ¤ Replication decisions

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.28 Professor: SHRIDEEP PALLICKARA

Communications with the chunk servers

¨ Periodic communications using heartbeats ¤ Instructions to the chunk server ¤ Collect/retrieve state from the chunk server

November 14, 2019

SLIDE 15

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.15

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.29 Professor: SHRIDEEP PALLICKARA

Chunk size

¨ This is fixed at 64 MB ¤ Much larger than typical filesystem block sizes (512 bytes) ¨ Lazy space allocation ¤ Stored as plain Linux file ¤ Extended only as needed

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.30 Professor: SHRIDEEP PALLICKARA

But why this big?

November 14, 2019

¨ Reduces client interaction with the master ¤ Can cache info for a multi-TB working set ¨ Reduce network overhead ¤ With a large chunk, client performs more operations ¤ Persistent connections ¨ Reduce size of metadata stored in the master ¤ 64 bytes of metadata per 64 MB chunk

SLIDE 16

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.16

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.31 Professor: SHRIDEEP PALLICKARA

Why keep the entire metadata in memory?

¨ Speed ¨ Master can scan its state in the background ¤ Implement chunk garbage collection ¤ Re-replicate if there are failures ¤ Chunk migration to balance load and space ¨ Add extra memory to increase file system size

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.32 Professor: SHRIDEEP PALLICKARA

Size of the file system with 1 TB of RAM: Assume file sizes are exact

multiples of chunk sizes

November 14, 2019

¨ Number of entries = 240/26 ¨ MAXIMUM SIZE of the file system

= Number of entries x Chunk size = 240 x 26 x 220 26 = 260 = 1 EB

SLIDE 17

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.17

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.33 Professor: SHRIDEEP PALLICKARA

Tracking the chunk servers

November 14, 2019

¨ Master does not keep a persistent copy of the location of chunk

servers

¨ List maintained via heart-beats ¤ Allows list to be in sync with reality despite failures ¤ Chunk server has final word on chunks it holds

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.34 Professor: SHRIDEEP PALLICKARA

Caching at the client/chunk servers

November 14, 2019

¨ Clients do not cache file data ¤ At client the working set may be too large ¤ Simplify client; eliminate cache-coherence problems ¨ Chunk servers do not cache file data either ¤ Chunks are stored as local files ¤ Linux’s buffer cache already keeps frequently accessed data in memory

SLIDE 18

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.18

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

MANAGING MUTATIONS

Handling writes and appends to a file

November 14, 2019

L24.35 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.36 Professor: SHRIDEEP PALLICKARA

Mutations

November 14, 2019

¨ Mutation changes the content or metadata of a chunk ¤ Write ¤ Append ¨ Each mutation is performed at all chunk replicas

SLIDE 19

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.19

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.37 Professor: SHRIDEEP PALLICKARA

GFS uses leases to maintain consistent mutation

rder across replicas

¨ Master grants lease to one of the replicas ¤ PRIMARY ¨ Primary picks serial-order ¤ For all mutations to the chunk ¤ Other replicas follow this order n When applying mutations

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.38 Professor: SHRIDEEP PALLICKARA

Lease mechanism designed to minimize communications with the master

November 14, 2019

¨ Lease has initial timeout of 60 seconds ¨ As long as chunk is being mutated ¤ Primary can request and receive extensions ¨ Extension requests/grants piggybacked over heart-beat messages

SLIDE 20

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.20

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.39 Professor: SHRIDEEP PALLICKARA

Revocation and transfer of leases

¨ Master may revoke a lease before it expires ¨ If communications lost with primary ¤ Master can safely give lease to another replica n ONLY AFTER the lease period for old primary elapses

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.40 Professor: SHRIDEEP PALLICKARA

How a write is actually performed

Client MASTER

Secondary Replica A Primary Replica Secondary Replica B

November 14, 2019

SLIDE 21

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.21

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.41 Professor: SHRIDEEP PALLICKARA

Client pushes data to all the replicas (I)

November 14, 2019

¨ Each chunk server stores data in an LRU buffer until ¤ Data is used ¤ Aged out

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.42 Professor: SHRIDEEP PALLICKARA

Client pushes data to all the replicas (II)

November 14, 2019

¨ When chunk servers acknowledge receipt of data ¤ Client sends a write request to primary ¨ Primary assigns consecutive serial numbers to mutations ¤ Forwards to replicas

SLIDE 22

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.22

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.43 Professor: SHRIDEEP PALLICKARA

Data flow is decoupled from the control flow to utilize network efficiently

November 14, 2019

¨ Utilize each machine’s network bandwidth ¨ Avoid network bottlenecks ¨ Avoid high-latency links ¨ Leverage network topology ¤ Estimate distances from IP addresses

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.44 Professor: SHRIDEEP PALLICKARA

What if the secondary replicas could not finish the write operation?

¨ Client request is considered failed ¨ Modified region is inconsistent ¤ No attempt to delete this from the chunk ¤ Client must handle this inconsistency ¨ Client retries the failed mutation

November 14, 2019

SLIDE 23

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.23

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.45 Professor: SHRIDEEP PALLICKARA

GFS client code implements the file system API

November 14, 2019

¨ Communications with master and chunk servers done transparently ¤ On behalf of apps that read or write data ¨ Interact with master for metadata ¨ Data-bearing communications directly to chunk servers

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.46 Professor: SHRIDEEP PALLICKARA

Traditional writes

November 14, 2019

¨ Client specifies offset at which data needs to be written ¨ Concurrent writes to same region ¤ Not serializable ¤ Region ends up containing data fragments from multiple clients

SLIDE 24

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.24

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.47 Professor: SHRIDEEP PALLICKARA

Atomic record appends

¨ Client specifies only the data not the offset ¨ GFS appends it to the file ¤ At least once atomically ¤ At an offset of GFS’ choosing ¨ No need for a distributed lock manger

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.48 Professor: SHRIDEEP PALLICKARA

The control flow for record appends is similar to that

f writes

¨ Client pushes data to replicas of the last chunk of file ¨ Primary replica checks if the record fits in this chunk

November 14, 2019

SLIDE 25

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.25

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.49 Professor: SHRIDEEP PALLICKARA

Primary replica checks if the record append will breach the size (64MB) threshold

¨ If chunk size would be breached ¤ Pad the chunk to maximum size ¤ Tell client, that operation should be retried on next chunk ¨ If the record fits, the primary ¤ Appends data to its replica ¤ Notifies secondaries to write at the exact offset

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.50 Professor: SHRIDEEP PALLICKARA

Record sizes and fragmentation

¨ Size is restricted to ¼ the chunk size ¨ Minimizes worst-case fragmentation ¤ Internal fragmentation in each chunk …

November 14, 2019

SLIDE 26

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.26

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.51 Professor: SHRIDEEP PALLICKARA

What if record append fails at one of the replicas

¨ Client must retry the operation ¨ Replicas of same chunk may contain ¤ Different data ¤ Duplicates of the same record n In whole or in part ¨ Replicas of chunks are not bit-wise identical! ¤ In most systems, replicas are identical

November 14, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.52 Professor: SHRIDEEP PALLICKARA

GFS only guarantees that the data will be written at least once as an atomic unit

November 14, 2019

¨ For an operation to return success ¤ Data must be written at the same offset on all the replicas ¨ After the write, all replicas are as long as the end of the record ¤ Any future record will be assigned a higher offset or a different chunk

SLIDE 27

SLIDES CREATED BY: SHRIDEEP PALLICKARA L24.27

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science, Colorado State University

L24.53 Professor: SHRIDEEP PALLICKARA

The contents of this slide-set are based on the following references

November 14, 2019 ¨ Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,

Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels: Dynamo: Amazon's Highly Available Key-value Store. SOSP 2007: 205-220

¨ Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system.