Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e - - PowerPoint PPT Presentation
Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e - - PowerPoint PPT Presentation
Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e Cloud Computing A.A. 2017/18 Valeria Cardellini Why to scale
Why to scale the storage?
The storage capacities and data transfer rate have increased massively over the years Let's consider the time needed to transfer data*
1
HDD Size: ~1TB Speed: 250MB/s SSD Size: ~1TB Speed: 850MB/s Data Size HDD SSD 10 GB 40s 12s 100 GB 6m 49s 2m 1 TB 1h 9m 54s 20m 33s 10 TB ? ?
* we consider no overhead
We need to scale out!
Valeria Cardellini - SDCC 2017/18
General principles for scalable data storage
- Scalability and high performance
– To face the continuous growth of data to store – Use multiple storage nodes
- Ability to run on commodity hardware
– Hardware failures are the norm rather than the exception
- Reliability and fault tolerance
– Transparent data replication
- Availability
– Data should be available when needed – CAP theorem: trade-off with consistency
2 Valeria Cardellini - SDCC 2017/18
Solutions for scalable data storage
Various forms of scalable data storage:
- Distributed file systems
– Manage (large) files on multiple nodes – Examples: Google File System, Hadoop Distributed File System
- NoSQL databases (more generally, NoSQL data stores)
– Simple and flexible non-relational data models – Horizontal scalability and fault tolerance – Key-value, column family, document, and graph stores – Examples: BigTable, Cassandra, MongoDB, HBase, DynamoDB – Existing time series databases are built on top of NoSQL databases (examples: InfluxDB, KairosDB)
- NewSQL databases
– Horizontal scalability and fault tolerance to the relational model – Examples: VoltDB, Google Spanner
3 Valeria Cardellini - SDCC 2017/18
Scalable data storage solutions
4
The whole picture of the different solutions
Valeria Cardellini - SDCC 2017/18
Data storage in the Cloud
- Main goals:
– Massive scaling “on demand” (elasticity) – Data availability – Simplified application development and deployment
- Some storage systems offered only as Cloud services
– Either directly (e.g., Amazon DynamoDB, Google Bigtable, Google Cloud Storage) or as part of a programming environment
- Other proprietary systems used only internally (e.g.,
Dynamo, GFS)
5 Valeria Cardellini - SDCC 2017/18
Distributed file systems
- Represent the primary support for data management
- Manage data storage across a network of machines
- Provide an interface whereby to store information in
the form of files and later access them for read and write operations
– Using the traditional file system interface
- Several solutions with different design choices
– GFS, Apache HDFS (GFS open-source clone): designed for batch applications with large files – Alluxio: in-memory (high-throughput) storage system – Lustre, Ceph: designed for high performance
6 Valeria Cardellini - SDCC 2017/18
Where to store data?
- Memory I/O vs. disk I/O
- See “Latency numbers every programmer should know”
http://bit.ly/2pZXIU9
7 Valeria Cardellini - SDCC 2017/18
Case study: Google File System
- Distributed fault-tolerant file system implemented in
user space
- Manages (very) large files: usually multi-GB
- Divide et impera: file divided into fixed-size chunks
- Chunks:
– Have a fixed size – Transparent to users – Each chunk is stored as plain file
- Files follow the write-once, read-many-times pattern
– Efficient append operation: appends data at the end of a file atomically at least once even in the presence of concurrent
- perations (minimal synchronization overhead)
- Fault tolerance, high availability through chunk
replication
8
- S. Ghemawat, H. Gobioff, S.-T. Leung, "The Google File System”, ACM SOSP ‘03.
Valeria Cardellini - SDCC 2017/18
GFS operation environment
Valeria Cardellini - SDCC 2017/18 9
GFS: architecture
- Master
– Single, centralized entity (this simplifies the design) – Manages file metadata (stored in memory)
- Metadata: access control information, mapping from files to
chunks, chunk locations
– Does not store data (i.e., chunks) – Manages chunks: creation, replication, load balancing, deletion
10 Valeria Cardellini - SDCC 2017/18
GFS: architecture
- Chunkservers (100s – 1000s)
– Store chunks as files – Spread across cluster racks
- Clients
– Issue control (metadata) requests to GFS master – Issue data requests directly to GFS chunkservers – Cache metadata but not data (simplifies the design)
11 Valeria Cardellini - SDCC 2017/18
GFS: metadata
- The master stores three major types of metadata:
– File and chunk namespace (directory hierarchy) – Mapping from files to chunks – Current locations of chunks
- Metadata are stored in memory (64B per chunk)
– Pro: fast; easy and efficient to scan the entire state – Con: the number of chunks is limited by the amount of memory of the master:
"The cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility gained"
- The master also keeps an operation log with a
historical record of metadata changes
– Persistent on local disk – Replicated – Checkpoint for fast recovery
12 Valeria Cardellini - SDCC 2017/18
GFS: chunk size
- Chunk size is either 64 MB or 128 MB
– Much larger than typical block sizes
- Why? Large chunk size reduces:
– Number of interactions between client and master – Size of metadata stored on master – Network overhead (persistent TCP connection to the chunk server over an extended period of time)
- Potential disadvantage
– Chunks for small files may become hot spots
13 Valeria Cardellini - SDCC 2017/18
GFS: fault-tolerance and replication
- The master replicates (and maintains the replication) of
each chunk on several chunkservers
– At least 3 replicas on different chunkservers – Replication based on primary-backup schema – Replication degree > 3 for highly requested chunks
- Multi-level placement of replicas
– Different machines, local rack + reliability and availability – Different machines, different racks + aggregated bandwidth
- Data integrity
– Chunk divided in 64KB blocks; 32B checksum for each block – Checksum kept in memory – Checksum is checked every time a client reads data
14 Valeria Cardellini - SDCC 2017/18
GFS: master operations
- Stores metadata
- Manages and locks namespace
– Namespace represented as a lookup table
- Periodic communication with each chunkserver
– Sends instructions and collects chunkserver state (heartbeat messages)
- Creates, re-replicates, rebalances chunks
– Balances the disk space utilization and load balancing – Distributes replicas among racks to increase fault-tolerance – Re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal
15 Valeria Cardellini - SDCC 2017/18
GFS: master operations (2)
- Garbage collection
– File deletion logged by the master – File renamed to a hidden name with deletion timestamp: its deletion is postponed – Deleted files can be easily recovered in a limited timespan
- Stale replica detection
– Chunk replicas may become stale if a chunkserver fails or misses updates to the chunk – For each chunk, the master keeps a chunk version number – Chunk version number updated for each chunk mutation – The master removes stale replicas in its regular garbage collection
16 Valeria Cardellini - SDCC 2017/18
GFS: system interactions
- Files are hierarchically organized in directories
– There is no data structure that represents a directory
- A file is identified by its pathname
– GFS does not support aliases
- GFS supports traditional file system operations (but
no Posix API)
– create, delete, open, close, read, write
- Also supports two special operations:
– snapshot: makes a copy of a file or a directory tree almost instantaneously (based on copy-on-write techniques) – record append: atomically appends data to a file; multiple clients can append to the same file concurrently without fear
- f overwriting one another’s data
17 Valeria Cardellini - SDCC 2017/18
GFS: system interactions
18
- Read operation
- Data flow is decoupled from control flow
(1) Client sends master: read(file name, chunk index) (2) Master’s reply: chunk ID, chunk version number, locations of replicas (3) Client sends “closest” chunkserver w/replica: read(chunk ID, byte range) (4) Chunkserver replies with data 1 2 3 4
Valeria Cardellini - SDCC 2017/18
GFS: mutations
- Mutations are write or append
– Mutations are performed at all the chunk's replicas in the same order
- Based on lease mechanism:
– Goal: minimize management
- verhead at the master
– Master grants chunk lease to primary replica – Primary picks a serial order for all the mutations to the chunk – All replicas follow this order when applying mutations – Primary replies to client, see (7) – Leases renewed using periodic heartbeat messages between master and chunkservers
19
- Data flow is decoupled from
control flow
- To fully utilize network
bandwidth, data are pushed linearly along a chain of chunkservers
Valeria Cardellini - SDCC 2017/18
(3): Client sends data to closest replica first
GFS: atomic append
- The client specifies only the data (with no offset)
- GFS appends data to the file at least once atomically
(i.e., as one continuous sequence of bytes)
– At offset chosen by GFS – Works with multiple concurrent writers – At least once: applications must cope with possible duplicates
- Operation heavily used by Google's distributed
applications
– E.g., files often serve as multiple-producers/single-consumer queue or contain merged results from many clients (MapReduce scenario)
20 Valeria Cardellini - SDCC 2017/18
GFS: consistency model
- Changes to namespace (e.g., file creation) are atomic
– Managed exclusively by the master with locking guarantees
- Changes to data are ordered as chosen by a primary,
but failures can cause inconsistency
- GFS has a “relaxed” model: eventual consistency
– Simple and efficient to implement
- A file region is:
– Consistent: if all replicas have the same value – Defined: after a mutation if it is consistent and clients will see what the mutation writes in its entirety
- Properties:
– Concurrent successful mutations leave the region consistent but undefined: it may not reflect what any one mutation has written – A failed mutation makes the region inconsistent: chunk version number and re-replication used to restore data
21 Valeria Cardellini - SDCC 2017/18
GFS performance
Valeria Cardellini - SDCC 2017/18 22
- Read performance is satisfactory (80–100 MB/s)
- But reduced write performance (30 MB/s) and relatively
slow (5 MB/s) in appending data to existing files
GFS limitations
23
What's the limitation of this architecture? The single master!
- Single point of failure
- Lose the master, and
you’ve lost the filesystem!
- Scalability bottleneck
Valeria Cardellini - SDCC 2017/18
GFS limitations: single master
- Solutions adopted to overcome issues related to the
presence of a single master
– Overcome single point of failure: multiple “shadow” masters that provide read-only access when the primary master is down – Overcome scalability bottleneck: by reducing the interaction between the master and the client
- The master stores only metadata (not data)
- The client can cache metadata
- Large chunk size
- Chunk lease: delegates the authority of coordinating the
mutations to the primary replica
24 Valeria Cardellini - SDCC 2017/18
GFS summary
- GFS success
– Used actively by Google to support search service and other applications – Availability and recoverability on cheap hardware – High throughput by decoupling control and data – Supports massive data sets and concurrent appends
- GFS problems (besides single master)
– All metadata stored in master memory
- Problems when storage grew to more than tens of PB
– Semantics not transparent to apps – Performance not good for all apps
- Designed for high throughput but not appropriate for latency-
sensitive applications like Gmail
– GFS was designed (in 2001) for batch applications with large files
25 Valeria Cardellini - SDCC 2017/18
Successor of GFS: Colossus
26 Valeria Cardellini - SDCC 2017/18
- Proprietary cluster file system at Google released in
2010
- Specifically designed for real-time services
- Automatically sharded metadata layer
- Error-correcting codes as part of the fault tolerance
mechanisms
- Data typically written using Reed-Solomon (1.5x)
- Client-driven encoding and replication
- Distributed masters
- Supports smaller files: chunks go from 64 MB to 1 MB
- Google Cloud Storage: Cloud object store built on
Colossus
HDFS
- Hadoop Distributed File System (HDFS)
– Open-source user-level distributed file system – Written in Java – Part of the Hadoop framework for Big data batch processing – Quite similar to GFS
- Master/worker architecture
- Data is replicated across the cluster
- Designed to span large clusters of commodity servers
- Servers can fail and not abort the computation process
27 Valeria Cardellini - SDCC 2017/18
Shafer et al., “The Hadoop Distributed Filesystem: Balancing Portability and Performance”, ISPASS 2010.
HDFS: file management
28 Valeria Cardellini - SDCC 2017/18
HDFS: architecture
- Two types of nodes in a HDFS cluster:
– One NameNode (master in GFS) – Multiple DataNodes (chunkservers in GFS)
29 Valeria Cardellini - SDCC 2017/18
HDFS: architecture
- The DataNodes just store and retrieve the file blocks
(also shards or chunks) when they are told to (by clients or the NameNode)
- The NameNode:
– Manages the file system tree and the metadata for all the files and directories – Knows the DataNodes on which all the blocks for a given file are located – Without the NameNode HDFS cannot be used
- It is important to make the NameNode resilient to failure
30 Valeria Cardellini - SDCC 2017/18
HDFS: file read
31
Source: “Hadoop: The definitive guide”
- NameNode is only used to get block location
Valeria Cardellini - SDCC 2017/18
HDFS: file write
32
Source: “Hadoop: The definitive guide”
- Clients ask NameNode for a list of suitable DataNodes
- This list forms a pipeline: first DataNode stores a copy
- f a block, then forwards it to the second, and so on
Valeria Cardellini - SDCC 2017/18
Relational DBMS challenges
- Web-based applications caused spikes
– Internet-scale data size – High read-write rates – Frequent schema changes
- Let’s scale RDBMSs
– RDBMS were not designed to be distributed
- Possible solutions:
– Replication – Sharding
33 Valeria Cardellini - SDCC 2017/18
Database replication
- Master/worker architecture
– Primary-backup protocol
- Scales read operations
- Write operations?
34 Valeria Cardellini - SDCC 2017/18
Database sharding
- Horizontal partitioning of data across many separate
servers
- Allows to scales read and write operations
- Joins and transactions across shards (partitions) are
slow and difficult to perform
35 Valeria Cardellini - SDCC 2017/18
Scaling RDBMS is expensive and inefficient
36
Source: Couchbase technical report
Valeria Cardellini - SDCC 2017/18
NoSQL data stores
- NoSQL = Not Only SQL
– SQL-style querying is not the crucial objective
- Main features of NoSQL data stores
– Avoid unneeded complexity – Support flexible schema and simple data model – Scale horizontally – Provide scalability and high availability by storing and replicating data in distributed systems, often across datacenters – Useful when working with Big data and the data’s nature does not require a relational model – Do not typically support ACID properties, but rather BASE
37 Valeria Cardellini - SDCC 2017/18
ACID vs BASE
38
- Two design philosophies at opposite ends of the
consistency-availability spectrum
- Keep in mind the CAP theorem!
- ACID: the traditional approach to address the
consistency issue in RDBMS
– Pessimistic approach – Does not scale well – Traditional RDBMS are CA systems
- BASE: usually adopted in NoSQL data stores
- Optimistic approach
- Scales well
- Most NoSQL data stores are AP systems
Valeria Cardellini - SDCC 2017/18
Pessimistic vs. optimistic approach
39
- Concurrency involves a fundamental tradeoff between:
- Safety (avoiding errors such as update conflicts)
- Liveness (responding quickly to clients)
- Pessimistic approaches often:
- Severely degrade the responsiveness of a system
- Lead to deadlocks, which are hard to prevent and debug
Valeria Cardellini - SDCC 2017/18
NoSQL cost and performance
40
Source: Couchbase technical report
Valeria Cardellini - SDCC 2017/18
Pros and cons of NoSQL
- Easy to scale-out
- Higher performance for
massive data scale
- Allows sharing of data
across multiple servers
- Most solutions are either
- pen-source or cheaper
- HA and fault tolerance
provided by data replication
- Supports complex data
structures and objetcs
- No fixed schema, supports
unstructured data
- Very fast retrieval of data,
suitable for real-time apps
41
- Do not provide ACID
guarantees, less suitable for OLTP apps
- No fixed schema, no
common data storage model
- Limited support for
aggregation (sum, avg, count, group by)
- Performance for complex
join is poor
- No well defined approach for
DB design (different solutions have different data models)
- Lack of consistent model
can lead to solution lock-in
Pros Cons
Valeria Cardellini - SDCC 2017/18
Barriers to NoSQL
- Main barriers to NoSQL adoption
– No full ACID transaction support – Lack of standardized interfaces – Huge investments already made in existing RDBMSs
- A commercial example
– AWS launched two NoSQL services (SimpleDB in 2007 and later DynamoDB in 2012) and one RDBMS service (RDS in 2009)
42 Valeria Cardellini - SDCC 2017/18
NoSQL data models
- A number of largely diverse data stores not based on
the relational data model
43 Valeria Cardellini - SDCC 2017/18
NoSQL data models
- A data model is a set of constructs for representing
the information
– Relational model: tables, columns and rows
- Storage model: how the system stores and
manipulates the data internally
- A data model is usually independent of the storage
model
- Data models for NoSQL systems:
– Aggregate-oriented models: key-value, document, and column-family – Graph-based models
- Aggregate: data as units that have a complex
structure
– E.g.: complex record with simple fields, arrays, records nested inside
44 Valeria Cardellini - SDCC 2017/18
Transactions?
- RDBMSs do have ACID transactions!
- NoSQL aggregate-oriented data stores:
– Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates
- Update over multiple aggregates: possible inconsistent reads
– Part of the consideration for deciding how to aggregate data
- Graph databases tend to support ACID transactions
45 Valeria Cardellini - SDCC 2017/18
Key-value data model
- Simple data model in which data is represented as a
collection of key-value pairs
– Associative array (map or dictionary) as fundamental data model
- Strongly aggregate-oriented
– Lots of aggregates – Each aggregate has a key
- Data model:
– A set of <key,value> pairs – Value: an aggregate instance
- The aggregate is opaque to the database
– Just a big blob of mostly meaningless bit
- Access to an aggregate:
– Lookup based on its key
- Richer data models can be implemented on top
46 Valeria Cardellini - SDCC 2017/18
Query features in key-value data stores
- Only query by the key!
– There is a key and there is the rest of the data (the value)
- It is not possible to use some attribute of the value
column
- The key needs to be suitably chosen
– E.g., session ID for storing session data
- What if we don’t know the key?
– Some system allows the search inside the value using a full- text search (e.g., using Apache Solr)
47 Valeria Cardellini - SDCC 2017/18
Key-value data stores
- Adopt consistency models ranging from eventual to
sequential consistency
- Some maintain data in memory (RAM), while others
employ solid-state drives or rotating disks
- Amazon’s Dynamo is the most notable example
– By Amazon, but different from DynamoDB
- Other key-value stores include:
– Riak
- Open-source implementation of Dynamo
– Amazon DynamoDB
- Data model and name from Dynamo, but different implementation
– Amazon S3 – Memcached – Redis
- Memcached and Redis are in-memory data stores
48 Valeria Cardellini - SDCC 2017/18
Column-family data model
- Strongly aggregate-oriented
– Lots of aggregates – Each aggregate has a key
- Similar to a key/value store, but the value can have
multiple attributes (columns)
- Data model: a two-level map structure:
– A set of <row-key, aggregate> pairs – Each aggregate is a group of pairs <column-key, value> – Column: a set of data values of a particular type
- Structure of the aggregate visible
- Columns can be organized in families
– Data usually accessed together
49 Valeria Cardellini - SDCC 2017/18
Column-family data model
50 Valeria Cardellini - SDCC 2017/18
Column-family data model
51
- Store and process data by column instead of row
– Can access faster the data needed rather than scanning and discarding unwanted data in row – But the primary key is the data
…;Smith:001;Jones:002,004;Johnson:003;…
Valeria Cardellini - SDCC 2017/18
Column-family data model
52
- In many queries, few attributes are needed
– Column values are stored contiguously on disk: reduces I/O
- Both rows and columns are split over multiple nodes to
achieve scalability
- So column-family data stores are suitable for read-
mostly, read-intensive, large data repositories
Valeria Cardellini - SDCC 2017/18
Column-family data stores
- Google’s Bigtable is the most notable example
– Built on GFS and Chubby lock service – Data storage organized in tables, whose rows are distributed
- ver GFS
– Available as Cloud Bigtable on Google Cloud Platform
- Other column-family stores:
– Apache Hbase
- Open-source implementation of Bigtable on top of Hadoop and
HDFS
– Cassandra – Amazon Redshift
53 Valeria Cardellini - SDCC 2017/18
Document data model
- Strongly aggregate-oriented
– Lots of aggregates – Each aggregate has a key
- Similar to a key-value store (unique key), but API or
query/update language to query or update based on the internal structure in the document
– The document content is no longer opaque
- Similar to a column-family store, but values can have
complex documents, instead of fixed format
- Document: encapsulates and encodes data in some
standard formats or encodings
– XML, JSON, BSON (binary JSON), …
54 Valeria Cardellini - SDCC 2017/18
Document data model
- Data model:
– A set of <key, document> pairs – Document: an aggregate instance
- Structure of the aggregate is
visible
– Limits on what we can place in it
- Access to an aggregate
– Queries based on the fields in the aggregate
- Flexible schema
– No strict schema to which documents must conform, which eliminates the need of schema migration efforts
55 Valeria Cardellini - SDCC 2017/18
Document data stores
- MongoDB and CouchDB are the two major
representatives
– Documents grouped together to form collections – Collections organized into databases
- Other document stores:
– Couchbase – Azure Cosmo DB as Cloud service on Azure Cloud platform
Valeria Cardellini - SDCC 2017/18 56
Graph data model
- Uses graph structures with nodes, edges, and
properties to represent stored data
– Nodes are the entities and have a set of attributes – Edges are the relationships between the entities
- E.g.: an author writes a book
– Nodes and edges also have individual properties
- Powerful data model
– Differently from other types of NoSQL stores, it concerns itself with relationships – Focus on visual representation of information (more human- friendly than other NoSQL stores) – Other types of NoSQL stores are poor for interconnected data
57 Valeria Cardellini - SDCC 2017/18
Graph data model: example
- A network of programmers
58 Valeria Cardellini - SDCC 2017/18
Graph databases
- Explicit graph structure
- Major representatives:
– Neo4j – OrientDB
- Cons:
– Sharding: data partitioning is difficult – Horizontal scalability
- When related nodes are stored on different servers, traversing
multiple servers is not performance-efficient
– Require rewiring your brain
59 Valeria Cardellini - SDCC 2017/18
Takeaways
- Don’t get confused by many data models
- No solution is the best one in absolute
– The choice depends on app and workload characteristics – You can even use multiple data stores for different tasks of the same app
- Polyglot data persistence: use different data storage solution
for varying needs
Valeria Cardellini - SDCC 2017/18 60
Case study: Amazon’s Dynamo
- Highly available and scalable distributed key-value
data store built for Amazon’s platform
– A very diverse set of Amazon applications with different storage requirements – Need for storage technologies that are always available on a commodity hardware infrastructure
- E.g., shopping cart service: “Customers should be able to view
and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados”
– Meet stringent Service Level Agreements (SLAs)
- E.g., “service guaranteeing that it will provide a response within
300ms for 99.9% of its requests for a peak client load of 500 requests per second.”
61
- G. DeCandia et al., "Dynamo: Amazon's highly available key-value store",
- Proc. of ACM SOSP 2007.
Valeria Cardellini - SDCC 2017/18
Dynamo features
- Simple key-value API
– Simple operations to read (get) and write (put) objects uniquely identified by a key – Each operation involves only one object at time
- Focus on eventually consistent store
– Sacrifices consistency for availability – BASE rather than ACID
- Efficient usage of resources
- Simple scale-out schema to manage increasing data
set or request rates
- Internal use of Dynamo
– Security is not an issue since operation environment is assumed to be non-hostile
62 Valeria Cardellini - SDCC 2017/18
Dynamo design principles
- Sacrifice consistency for availability (CAP theorem)
- Use optimistic replication techniques
- Possible conflicting changes which must be detected
and resolved: when to resolve them and who resolves them?
– When: execute conflict resolution during reads rather than writes, i.e. “always writeable” data store – Who: data store or application; if data store, use simple policy (e.g., “last write wins”)
- Other key principles:
– Incremental scalability
- Scale-out with minimal impact on the system
– Simmetry and decentralization
- P2P techniques
– Heterogeneity
63 Valeria Cardellini - SDCC 2017/18
Dynamo API
- Each stored object has an associated key
- Simple API including get() and put() operations to
read and write objects
get(key)
- Returns single object or list of objects with conflicting versions
and context
- Conflicts are handled on reads, never reject a write
put(key, context, object)
- Determines where the replicas of the object should be placed
based on the associated key, and writes the replicas to disk
- Context encodes system metadata, e.g., version number
– Both key and object treated as opaque array of bytes – Key: 128-bit MD5 hash applied to client supplied key
64 Valeria Cardellini - SDCC 2017/18
Techniques used in Dynamo
65
Problem Technique Advantage
Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
Valeria Cardellini - SDCC 2017/18
Data partitioning in Dynamo
- Consistent hashing: output range of a hash is treated
as a ring (similar to Chord)
– MD5(key) -> node (position on the ring) – Differently from Chord: zero-hop DHT
- “Virtual nodes”
– Each node can be responsible for more than one virtual node – Work distribution proportional to the capabilities of the individual node
66 Valeria Cardellini - SDCC 2017/18
Replication in Dynamo
- Each object is replicated on N nodes
– N is a parameter configured per-instance by the application
- Preference list: list of nodes that is responsible for
storing a particular key
– More than N nodes to account for node failures – See figure: object identified by key K is replicated on nodes B, C and D
67
- Node D will store the keys in the
ranges (A, B], (B, C], and (C, D]
Valeria Cardellini - SDCC 2017/18
Techniques used in Dynamo
Valeria Cardellini - SDCC 2017/18 68
Problem Technique Advantage
Partitioning Consistent hashing Incremental scalability High availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
Data versioning in Dynamo
- A put() call may return to its caller before the
update has been applied at all the replicas
- A get() call operation may return an object that
does not have the latest updates
- Version branching can also happen due to node/
network failures
- Problem: multiple versions of an object, that the
system needs to reconcile
- Solution: use vectorial clocks to capture the casuality
among different versions of the same object
– If causal: older version can be forgotten – If concurrent: conflict exists, requiring reconciliation
69 Valeria Cardellini - SDCC 2017/18
Techniques used in Dynamo
70
Problem Technique Advantage
Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
Valeria Cardellini - SDCC 2017/18
Sloppy quorum in Dynamo
- R/W: minimum number of nodes that must participate
in a successful read/write operation
- Setting R + W > N yields a quorum-like system
– The latency of a get() or put() operation is dictated by the slowest of the R or W replicas – R and W are usually configured to be less than N, to provide better latency – Typical configuration in Dynamo: (N, R, W) = (3, 2, 2)
- Balances performance, durability, and availability
- Sloppy quorum
– Due to partitions, quorums might not exist – Sloppy quorum: create transient replicas
- N healthy nodes from the preference list (may not always
be the first N nodes encountered while walking the consistent hashing ring)
71 Valeria Cardellini - SDCC 2017/18
Put and get operations
- put
– Coordinator generates new vector clock and writes the new version locally – Send to N nodes – Wait for response from W nodes
- get
– Coordinator requests existing versions from N
- Wait for response from R nodes
– If multiple versions, return all versions that are causally unrelated – Divergent versions are then reconciled – Reconciled version written back
72 Valeria Cardellini - SDCC 2017/18
Hinted handoff in Dynamo
- Consider N = 3; if A is
temporarily down or unreachable, put will use D
- D knows that the replica
belongs to A
- Later, D detects A is alive
– Sends the replica to A – Removes the replica
73
- Hinted handoff for transient failures
- Again, “always writeable” principle
Valeria Cardellini - SDCC 2017/18
Techniques used in Dynamo
74
Problem Technique Advantage
Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
Valeria Cardellini - SDCC 2017/18
Membership management
- Administrator explicitly adds and removes nodes
- Gossiping to propagate membership changes
– Eventually consistent view – O(1) hop overlay
75 Valeria Cardellini - SDCC 2017/18
Case study: Memcached
- Free and open source key-value data
store, used as caching layer by Flickr, Twitter, Wikipedia, Youtube
- High-performance, distributed
memory object caching system
– Generic in nature, but intended for use in speeding up dynamic web applications by alleviating load on data layer
- Provides in-memory key-value store
for small chunks of arbitrary data (strings, objects) from results of database queries, API calls, or page rendering
- API available for most languages
- Available on AWS as ElastiCache
76 Valeria Cardellini - SDCC 2017/18
Case study: AWS DynamoDB
- DynamoDB: NoSQL document and key-value data store
fully managed by AWS
– Global service (no choice of AWS region)
- Goal: fast and predictable performance with seamless
scalability
77
- AWS services for the data tier
Valeria Cardellini - SDCC 2017/18
Case study: DynamoDB
- Consistency model
– Eventually consistent reads (default)
- Maximizes read throughput
– Strongly consistent reads
- Durability
– Writes continuously replicated to 3 AWS availability zones – Quorum acknowledgment – Persisted on disk
- Automatic partitioning
– Automatically spreads the table data and traffic over multiple servers to handle the request capacity specified by the customer and the amount of data stored – Partitions and re-partitions the data as the table size grows
- How to use it
– AWS Management Console or Amazon DynamoDB APIs
78 Valeria Cardellini - SDCC 2017/18
DynamoDB: example app
- App: scalable URL shortener
- Architecture
Valeria Cardellini - SDCC 2017/18 79
DynamoDB: example app
- Architecture
Valeria Cardellini - SDCC 2017/18 80
DynamoDB: example app
Valeria Cardellini - SDCC 2017/18 81
Primary key Attributes (schema-less)
- For full example, see http://bit.ly/2C4pAv0