[PPT] - Distributed and Cloud Storage Systems Corso di Sistemi Distribuiti e PowerPoint Presentation

SLIDE 1

Distributed and Cloud Storage Systems

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Corso di Sistemi Distribuiti e Cloud Computing A.A. 2017/18 Valeria Cardellini

SLIDE 2

Why to scale the storage?

The storage capacities and data transfer rate have increased massively over the years Let's consider the time needed to transfer data*

1

HDD Size: ~1TB Speed: 250MB/s SSD Size: ~1TB Speed: 850MB/s Data Size HDD SSD 10 GB 40s 12s 100 GB 6m 49s 2m 1 TB 1h 9m 54s 20m 33s 10 TB ? ?

* we consider no overhead

We need to scale out!

Valeria Cardellini - SDCC 2017/18

SLIDE 3

General principles for scalable data storage

Scalability and high performance

– To face the continuous growth of data to store – Use multiple storage nodes

Ability to run on commodity hardware

– Hardware failures are the norm rather than the exception

Reliability and fault tolerance

– Transparent data replication

Availability

– Data should be available when needed – CAP theorem: trade-off with consistency

2 Valeria Cardellini - SDCC 2017/18

SLIDE 4

Solutions for scalable data storage

Various forms of scalable data storage:

Distributed file systems

– Manage (large) files on multiple nodes – Examples: Google File System, Hadoop Distributed File System

NoSQL databases (more generally, NoSQL data stores)

– Simple and flexible non-relational data models – Horizontal scalability and fault tolerance – Key-value, column family, document, and graph stores – Examples: BigTable, Cassandra, MongoDB, HBase, DynamoDB – Existing time series databases are built on top of NoSQL databases (examples: InfluxDB, KairosDB)

NewSQL databases

– Horizontal scalability and fault tolerance to the relational model – Examples: VoltDB, Google Spanner

3 Valeria Cardellini - SDCC 2017/18

SLIDE 5

Scalable data storage solutions

4

The whole picture of the different solutions

Valeria Cardellini - SDCC 2017/18

SLIDE 6

Data storage in the Cloud

Main goals:

– Massive scaling “on demand” (elasticity) – Data availability – Simplified application development and deployment

Some storage systems offered only as Cloud services

– Either directly (e.g., Amazon DynamoDB, Google Bigtable, Google Cloud Storage) or as part of a programming environment

Other proprietary systems used only internally (e.g.,

Dynamo, GFS)

5 Valeria Cardellini - SDCC 2017/18

SLIDE 7

Distributed file systems

Represent the primary support for data management
Manage data storage across a network of machines
Provide an interface whereby to store information in

the form of files and later access them for read and write operations

– Using the traditional file system interface

Several solutions with different design choices

– GFS, Apache HDFS (GFS open-source clone): designed for batch applications with large files – Alluxio: in-memory (high-throughput) storage system – Lustre, Ceph: designed for high performance

6 Valeria Cardellini - SDCC 2017/18

SLIDE 8

Where to store data?

Memory I/O vs. disk I/O
See “Latency numbers every programmer should know”

http://bit.ly/2pZXIU9

7 Valeria Cardellini - SDCC 2017/18

SLIDE 9

Case study: Google File System

Distributed fault-tolerant file system implemented in

user space

Manages (very) large files: usually multi-GB
Divide et impera: file divided into fixed-size chunks
Chunks:

– Have a fixed size – Transparent to users – Each chunk is stored as plain file

Files follow the write-once, read-many-times pattern

– Efficient append operation: appends data at the end of a file atomically at least once even in the presence of concurrent

perations (minimal synchronization overhead)
Fault tolerance, high availability through chunk

replication

8

S. Ghemawat, H. Gobioff, S.-T. Leung, "The Google File System”, ACM SOSP ‘03.

Valeria Cardellini - SDCC 2017/18

SLIDE 10

GFS operation environment

Valeria Cardellini - SDCC 2017/18 9

SLIDE 11

GFS: architecture

Master

– Single, centralized entity (this simplifies the design) – Manages file metadata (stored in memory)

Metadata: access control information, mapping from files to

chunks, chunk locations

– Does not store data (i.e., chunks) – Manages chunks: creation, replication, load balancing, deletion

10 Valeria Cardellini - SDCC 2017/18

SLIDE 12

GFS: architecture

Chunkservers (100s – 1000s)

– Store chunks as files – Spread across cluster racks

Clients

– Issue control (metadata) requests to GFS master – Issue data requests directly to GFS chunkservers – Cache metadata but not data (simplifies the design)

11 Valeria Cardellini - SDCC 2017/18

SLIDE 13

GFS: metadata

The master stores three major types of metadata:

– File and chunk namespace (directory hierarchy) – Mapping from files to chunks – Current locations of chunks

Metadata are stored in memory (64B per chunk)

– Pro: fast; easy and efficient to scan the entire state – Con: the number of chunks is limited by the amount of memory of the master:

"The cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility gained"

The master also keeps an operation log with a

historical record of metadata changes

– Persistent on local disk – Replicated – Checkpoint for fast recovery

12 Valeria Cardellini - SDCC 2017/18

SLIDE 14

GFS: chunk size

Chunk size is either 64 MB or 128 MB

– Much larger than typical block sizes

Why? Large chunk size reduces:

– Number of interactions between client and master – Size of metadata stored on master – Network overhead (persistent TCP connection to the chunk server over an extended period of time)

Potential disadvantage

– Chunks for small files may become hot spots

13 Valeria Cardellini - SDCC 2017/18

SLIDE 15

GFS: fault-tolerance and replication

The master replicates (and maintains the replication) of

each chunk on several chunkservers

– At least 3 replicas on different chunkservers – Replication based on primary-backup schema – Replication degree > 3 for highly requested chunks

Multi-level placement of replicas

– Different machines, local rack + reliability and availability – Different machines, different racks + aggregated bandwidth

Data integrity

– Chunk divided in 64KB blocks; 32B checksum for each block – Checksum kept in memory – Checksum is checked every time a client reads data

14 Valeria Cardellini - SDCC 2017/18

SLIDE 16

GFS: master operations

Stores metadata
Manages and locks namespace

– Namespace represented as a lookup table

Periodic communication with each chunkserver

– Sends instructions and collects chunkserver state (heartbeat messages)

Creates, re-replicates, rebalances chunks

– Balances the disk space utilization and load balancing – Distributes replicas among racks to increase fault-tolerance – Re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal

15 Valeria Cardellini - SDCC 2017/18

SLIDE 17

GFS: master operations (2)

Garbage collection

– File deletion logged by the master – File renamed to a hidden name with deletion timestamp: its deletion is postponed – Deleted files can be easily recovered in a limited timespan

Stale replica detection

– Chunk replicas may become stale if a chunkserver fails or misses updates to the chunk – For each chunk, the master keeps a chunk version number – Chunk version number updated for each chunk mutation – The master removes stale replicas in its regular garbage collection

16 Valeria Cardellini - SDCC 2017/18

SLIDE 18

GFS: system interactions

Files are hierarchically organized in directories

– There is no data structure that represents a directory

A file is identified by its pathname

– GFS does not support aliases

GFS supports traditional file system operations (but

no Posix API)

– create, delete, open, close, read, write

Also supports two special operations:

– snapshot: makes a copy of a file or a directory tree almost instantaneously (based on copy-on-write techniques) – record append: atomically appends data to a file; multiple clients can append to the same file concurrently without fear

f overwriting one another’s data

17 Valeria Cardellini - SDCC 2017/18

SLIDE 19

GFS: system interactions

18

Read operation
Data flow is decoupled from control flow

(1) Client sends master: read(file name, chunk index) (2) Master’s reply: chunk ID, chunk version number, locations of replicas (3) Client sends “closest” chunkserver w/replica: read(chunk ID, byte range) (4) Chunkserver replies with data 1 2 3 4

Valeria Cardellini - SDCC 2017/18

SLIDE 20

GFS: mutations

Mutations are write or append

– Mutations are performed at all the chunk's replicas in the same order

Based on lease mechanism:

– Goal: minimize management

verhead at the master

– Master grants chunk lease to primary replica – Primary picks a serial order for all the mutations to the chunk – All replicas follow this order when applying mutations – Primary replies to client, see (7) – Leases renewed using periodic heartbeat messages between master and chunkservers

19

Data flow is decoupled from

control flow

To fully utilize network

bandwidth, data are pushed linearly along a chain of chunkservers

Valeria Cardellini - SDCC 2017/18

(3): Client sends data to closest replica first

SLIDE 21

GFS: atomic append

The client specifies only the data (with no offset)
GFS appends data to the file at least once atomically

(i.e., as one continuous sequence of bytes)

– At offset chosen by GFS – Works with multiple concurrent writers – At least once: applications must cope with possible duplicates

Operation heavily used by Google's distributed

applications

– E.g., files often serve as multiple-producers/single-consumer queue or contain merged results from many clients (MapReduce scenario)

20 Valeria Cardellini - SDCC 2017/18

SLIDE 22

GFS: consistency model

Changes to namespace (e.g., file creation) are atomic

– Managed exclusively by the master with locking guarantees

Changes to data are ordered as chosen by a primary,

but failures can cause inconsistency

GFS has a “relaxed” model: eventual consistency

– Simple and efficient to implement

A file region is:

– Consistent: if all replicas have the same value – Defined: after a mutation if it is consistent and clients will see what the mutation writes in its entirety

Properties:

– Concurrent successful mutations leave the region consistent but undefined: it may not reflect what any one mutation has written – A failed mutation makes the region inconsistent: chunk version number and re-replication used to restore data

21 Valeria Cardellini - SDCC 2017/18

SLIDE 23

GFS performance

Valeria Cardellini - SDCC 2017/18 22

Read performance is satisfactory (80–100 MB/s)
But reduced write performance (30 MB/s) and relatively

slow (5 MB/s) in appending data to existing files

SLIDE 24

GFS limitations

23

What's the limitation of this architecture? The single master!

Single point of failure
Lose the master, and

you’ve lost the filesystem!

Scalability bottleneck

Valeria Cardellini - SDCC 2017/18

SLIDE 25

GFS limitations: single master

Solutions adopted to overcome issues related to the

presence of a single master

– Overcome single point of failure: multiple “shadow” masters that provide read-only access when the primary master is down – Overcome scalability bottleneck: by reducing the interaction between the master and the client

The master stores only metadata (not data)
The client can cache metadata
Large chunk size
Chunk lease: delegates the authority of coordinating the

mutations to the primary replica

24 Valeria Cardellini - SDCC 2017/18

SLIDE 26

GFS summary

GFS success

– Used actively by Google to support search service and other applications – Availability and recoverability on cheap hardware – High throughput by decoupling control and data – Supports massive data sets and concurrent appends

GFS problems (besides single master)

– All metadata stored in master memory

Problems when storage grew to more than tens of PB

– Semantics not transparent to apps – Performance not good for all apps

Designed for high throughput but not appropriate for latency-

sensitive applications like Gmail

– GFS was designed (in 2001) for batch applications with large files

25 Valeria Cardellini - SDCC 2017/18

SLIDE 27

Successor of GFS: Colossus

26 Valeria Cardellini - SDCC 2017/18

Proprietary cluster file system at Google released in

2010

Specifically designed for real-time services
Automatically sharded metadata layer
Error-correcting codes as part of the fault tolerance

mechanisms

Data typically written using Reed-Solomon (1.5x)
Client-driven encoding and replication
Distributed masters
Supports smaller files: chunks go from 64 MB to 1 MB
Google Cloud Storage: Cloud object store built on

Colossus

SLIDE 28

HDFS

Hadoop Distributed File System (HDFS)

– Open-source user-level distributed file system – Written in Java – Part of the Hadoop framework for Big data batch processing – Quite similar to GFS

Master/worker architecture
Data is replicated across the cluster
Designed to span large clusters of commodity servers
Servers can fail and not abort the computation process

27 Valeria Cardellini - SDCC 2017/18

Shafer et al., “The Hadoop Distributed Filesystem: Balancing Portability and Performance”, ISPASS 2010.

SLIDE 29

HDFS: file management

28 Valeria Cardellini - SDCC 2017/18

SLIDE 30

HDFS: architecture

Two types of nodes in a HDFS cluster:

– One NameNode (master in GFS) – Multiple DataNodes (chunkservers in GFS)

29 Valeria Cardellini - SDCC 2017/18

SLIDE 31

HDFS: architecture

The DataNodes just store and retrieve the file blocks

(also shards or chunks) when they are told to (by clients or the NameNode)

The NameNode:

– Manages the file system tree and the metadata for all the files and directories – Knows the DataNodes on which all the blocks for a given file are located – Without the NameNode HDFS cannot be used

It is important to make the NameNode resilient to failure

30 Valeria Cardellini - SDCC 2017/18

SLIDE 32

HDFS: file read

31

Source: “Hadoop: The definitive guide”

NameNode is only used to get block location

Valeria Cardellini - SDCC 2017/18

SLIDE 33

HDFS: file write

32

Source: “Hadoop: The definitive guide”

Clients ask NameNode for a list of suitable DataNodes
This list forms a pipeline: first DataNode stores a copy
f a block, then forwards it to the second, and so on

Valeria Cardellini - SDCC 2017/18

SLIDE 34

Relational DBMS challenges

Web-based applications caused spikes

– Internet-scale data size – High read-write rates – Frequent schema changes

Let’s scale RDBMSs

– RDBMS were not designed to be distributed

Possible solutions:

– Replication – Sharding

33 Valeria Cardellini - SDCC 2017/18

SLIDE 35

Database replication

Master/worker architecture

– Primary-backup protocol

Scales read operations
Write operations?

34 Valeria Cardellini - SDCC 2017/18

SLIDE 36

Database sharding

Horizontal partitioning of data across many separate

servers

Allows to scales read and write operations
Joins and transactions across shards (partitions) are

slow and difficult to perform

35 Valeria Cardellini - SDCC 2017/18

SLIDE 37

Scaling RDBMS is expensive and inefficient

36

Source: Couchbase technical report

Valeria Cardellini - SDCC 2017/18

SLIDE 38

NoSQL data stores

NoSQL = Not Only SQL

– SQL-style querying is not the crucial objective

Main features of NoSQL data stores

– Avoid unneeded complexity – Support flexible schema and simple data model – Scale horizontally – Provide scalability and high availability by storing and replicating data in distributed systems, often across datacenters – Useful when working with Big data and the data’s nature does not require a relational model – Do not typically support ACID properties, but rather BASE

37 Valeria Cardellini - SDCC 2017/18

SLIDE 39

ACID vs BASE

38

Two design philosophies at opposite ends of the

consistency-availability spectrum

Keep in mind the CAP theorem!
ACID: the traditional approach to address the

consistency issue in RDBMS

– Pessimistic approach – Does not scale well – Traditional RDBMS are CA systems

BASE: usually adopted in NoSQL data stores
Optimistic approach
Scales well
Most NoSQL data stores are AP systems

Valeria Cardellini - SDCC 2017/18

SLIDE 40

Pessimistic vs. optimistic approach

39

Concurrency involves a fundamental tradeoff between:
Safety (avoiding errors such as update conflicts)
Liveness (responding quickly to clients)
Pessimistic approaches often:
Severely degrade the responsiveness of a system
Lead to deadlocks, which are hard to prevent and debug

Valeria Cardellini - SDCC 2017/18

SLIDE 41

NoSQL cost and performance

40

Source: Couchbase technical report

Valeria Cardellini - SDCC 2017/18

SLIDE 42

Pros and cons of NoSQL

Easy to scale-out
Higher performance for

massive data scale

Allows sharing of data

across multiple servers

Most solutions are either
pen-source or cheaper
HA and fault tolerance

provided by data replication

Supports complex data

structures and objetcs

No fixed schema, supports

unstructured data

Very fast retrieval of data,

suitable for real-time apps

41

Do not provide ACID

guarantees, less suitable for OLTP apps

No fixed schema, no

common data storage model

Limited support for

aggregation (sum, avg, count, group by)

Performance for complex

join is poor

No well defined approach for

DB design (different solutions have different data models)

Lack of consistent model

can lead to solution lock-in

Pros Cons

Valeria Cardellini - SDCC 2017/18

SLIDE 43

Barriers to NoSQL

Main barriers to NoSQL adoption

– No full ACID transaction support – Lack of standardized interfaces – Huge investments already made in existing RDBMSs

A commercial example

– AWS launched two NoSQL services (SimpleDB in 2007 and later DynamoDB in 2012) and one RDBMS service (RDS in 2009)

42 Valeria Cardellini - SDCC 2017/18

SLIDE 44

NoSQL data models

A number of largely diverse data stores not based on

the relational data model

43 Valeria Cardellini - SDCC 2017/18

SLIDE 45

NoSQL data models

A data model is a set of constructs for representing

the information

– Relational model: tables, columns and rows

Storage model: how the system stores and

manipulates the data internally

A data model is usually independent of the storage

model

Data models for NoSQL systems:

– Aggregate-oriented models: key-value, document, and column-family – Graph-based models

Aggregate: data as units that have a complex

structure

– E.g.: complex record with simple fields, arrays, records nested inside

44 Valeria Cardellini - SDCC 2017/18

SLIDE 46

Transactions?

RDBMSs do have ACID transactions!
NoSQL aggregate-oriented data stores:

– Support atomic transactions, but only within a single aggregate – Don’t have ACID transactions that span multiple aggregates

Update over multiple aggregates: possible inconsistent reads

– Part of the consideration for deciding how to aggregate data

Graph databases tend to support ACID transactions

45 Valeria Cardellini - SDCC 2017/18

SLIDE 47

Key-value data model

Simple data model in which data is represented as a

collection of key-value pairs

– Associative array (map or dictionary) as fundamental data model

Strongly aggregate-oriented

– Lots of aggregates – Each aggregate has a key

Data model:

– A set of <key,value> pairs – Value: an aggregate instance

The aggregate is opaque to the database

– Just a big blob of mostly meaningless bit

Access to an aggregate:

– Lookup based on its key

Richer data models can be implemented on top

46 Valeria Cardellini - SDCC 2017/18

SLIDE 48

Query features in key-value data stores

Only query by the key!

– There is a key and there is the rest of the data (the value)

It is not possible to use some attribute of the value

column

The key needs to be suitably chosen

– E.g., session ID for storing session data

What if we don’t know the key?

– Some system allows the search inside the value using a full- text search (e.g., using Apache Solr)

47 Valeria Cardellini - SDCC 2017/18

SLIDE 49

Key-value data stores

Adopt consistency models ranging from eventual to

sequential consistency

Some maintain data in memory (RAM), while others

employ solid-state drives or rotating disks

Amazon’s Dynamo is the most notable example

– By Amazon, but different from DynamoDB

Other key-value stores include:

– Riak

Open-source implementation of Dynamo

– Amazon DynamoDB

Data model and name from Dynamo, but different implementation

– Amazon S3 – Memcached – Redis

Memcached and Redis are in-memory data stores

48 Valeria Cardellini - SDCC 2017/18

SLIDE 50

Column-family data model

Strongly aggregate-oriented

– Lots of aggregates – Each aggregate has a key

Similar to a key/value store, but the value can have

multiple attributes (columns)

Data model: a two-level map structure:

– A set of <row-key, aggregate> pairs – Each aggregate is a group of pairs <column-key, value> – Column: a set of data values of a particular type

Structure of the aggregate visible
Columns can be organized in families

– Data usually accessed together

49 Valeria Cardellini - SDCC 2017/18

SLIDE 51

Column-family data model

50 Valeria Cardellini - SDCC 2017/18

SLIDE 52

Column-family data model

51

Store and process data by column instead of row

– Can access faster the data needed rather than scanning and discarding unwanted data in row – But the primary key is the data

…;Smith:001;Jones:002,004;Johnson:003;…

Valeria Cardellini - SDCC 2017/18

SLIDE 53

Column-family data model

52

In many queries, few attributes are needed

– Column values are stored contiguously on disk: reduces I/O

Both rows and columns are split over multiple nodes to

achieve scalability

So column-family data stores are suitable for read-

mostly, read-intensive, large data repositories

Valeria Cardellini - SDCC 2017/18

SLIDE 54

Column-family data stores

Google’s Bigtable is the most notable example

– Built on GFS and Chubby lock service – Data storage organized in tables, whose rows are distributed

ver GFS

– Available as Cloud Bigtable on Google Cloud Platform

Other column-family stores:

– Apache Hbase

Open-source implementation of Bigtable on top of Hadoop and

HDFS

– Cassandra – Amazon Redshift

53 Valeria Cardellini - SDCC 2017/18

SLIDE 55

Document data model

Strongly aggregate-oriented

– Lots of aggregates – Each aggregate has a key

Similar to a key-value store (unique key), but API or

query/update language to query or update based on the internal structure in the document

– The document content is no longer opaque

Similar to a column-family store, but values can have

complex documents, instead of fixed format

Document: encapsulates and encodes data in some

standard formats or encodings

– XML, JSON, BSON (binary JSON), …

54 Valeria Cardellini - SDCC 2017/18

SLIDE 56

Document data model

Data model:

– A set of <key, document> pairs – Document: an aggregate instance

Structure of the aggregate is

visible

– Limits on what we can place in it

Access to an aggregate

– Queries based on the fields in the aggregate

Flexible schema

– No strict schema to which documents must conform, which eliminates the need of schema migration efforts

55 Valeria Cardellini - SDCC 2017/18

SLIDE 57

Document data stores

MongoDB and CouchDB are the two major

representatives

– Documents grouped together to form collections – Collections organized into databases

Other document stores:

– Couchbase – Azure Cosmo DB as Cloud service on Azure Cloud platform

Valeria Cardellini - SDCC 2017/18 56

SLIDE 58

Graph data model

Uses graph structures with nodes, edges, and

properties to represent stored data

– Nodes are the entities and have a set of attributes – Edges are the relationships between the entities

E.g.: an author writes a book

– Nodes and edges also have individual properties

Powerful data model

– Differently from other types of NoSQL stores, it concerns itself with relationships – Focus on visual representation of information (more human- friendly than other NoSQL stores) – Other types of NoSQL stores are poor for interconnected data

57 Valeria Cardellini - SDCC 2017/18

SLIDE 59

Graph data model: example

A network of programmers

58 Valeria Cardellini - SDCC 2017/18

SLIDE 60

Graph databases

Explicit graph structure
Major representatives:

– Neo4j – OrientDB

Cons:

– Sharding: data partitioning is difficult – Horizontal scalability

When related nodes are stored on different servers, traversing

multiple servers is not performance-efficient

– Require rewiring your brain

59 Valeria Cardellini - SDCC 2017/18

SLIDE 61

Takeaways

Don’t get confused by many data models
No solution is the best one in absolute

– The choice depends on app and workload characteristics – You can even use multiple data stores for different tasks of the same app

Polyglot data persistence: use different data storage solution

for varying needs

Valeria Cardellini - SDCC 2017/18 60

SLIDE 62

Case study: Amazon’s Dynamo

Highly available and scalable distributed key-value

data store built for Amazon’s platform

– A very diverse set of Amazon applications with different storage requirements – Need for storage technologies that are always available on a commodity hardware infrastructure

E.g., shopping cart service: “Customers should be able to view

and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados”

– Meet stringent Service Level Agreements (SLAs)

E.g., “service guaranteeing that it will provide a response within

300ms for 99.9% of its requests for a peak client load of 500 requests per second.”

61

G. DeCandia et al., "Dynamo: Amazon's highly available key-value store",
Proc. of ACM SOSP 2007.

Valeria Cardellini - SDCC 2017/18

SLIDE 63

Dynamo features

Simple key-value API

– Simple operations to read (get) and write (put) objects uniquely identified by a key – Each operation involves only one object at time

Focus on eventually consistent store

– Sacrifices consistency for availability – BASE rather than ACID

Efficient usage of resources
Simple scale-out schema to manage increasing data

set or request rates

Internal use of Dynamo

– Security is not an issue since operation environment is assumed to be non-hostile

62 Valeria Cardellini - SDCC 2017/18

SLIDE 64

Dynamo design principles

Sacrifice consistency for availability (CAP theorem)
Use optimistic replication techniques
Possible conflicting changes which must be detected

and resolved: when to resolve them and who resolves them?

– When: execute conflict resolution during reads rather than writes, i.e. “always writeable” data store – Who: data store or application; if data store, use simple policy (e.g., “last write wins”)

Other key principles:

– Incremental scalability

Scale-out with minimal impact on the system

– Simmetry and decentralization

P2P techniques

– Heterogeneity

63 Valeria Cardellini - SDCC 2017/18

SLIDE 65

Dynamo API

Each stored object has an associated key
Simple API including get() and put() operations to

read and write objects

get(key)

Returns single object or list of objects with conflicting versions

and context

Conflicts are handled on reads, never reject a write

put(key, context, object)

Determines where the replicas of the object should be placed

based on the associated key, and writes the replicas to disk

Context encodes system metadata, e.g., version number

– Both key and object treated as opaque array of bytes – Key: 128-bit MD5 hash applied to client supplied key

64 Valeria Cardellini - SDCC 2017/18

SLIDE 66

Techniques used in Dynamo

65

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Valeria Cardellini - SDCC 2017/18

SLIDE 67

Data partitioning in Dynamo

Consistent hashing: output range of a hash is treated

as a ring (similar to Chord)

– MD5(key) -> node (position on the ring) – Differently from Chord: zero-hop DHT

“Virtual nodes”

– Each node can be responsible for more than one virtual node – Work distribution proportional to the capabilities of the individual node

66 Valeria Cardellini - SDCC 2017/18

SLIDE 68

Replication in Dynamo

Each object is replicated on N nodes

– N is a parameter configured per-instance by the application

Preference list: list of nodes that is responsible for

storing a particular key

– More than N nodes to account for node failures – See figure: object identified by key K is replicated on nodes B, C and D

67

Node D will store the keys in the

ranges (A, B], (B, C], and (C, D]

Valeria Cardellini - SDCC 2017/18

SLIDE 69

Techniques used in Dynamo

Valeria Cardellini - SDCC 2017/18 68

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability High availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

SLIDE 70

Data versioning in Dynamo

A put() call may return to its caller before the

update has been applied at all the replicas

A get() call operation may return an object that

does not have the latest updates

Version branching can also happen due to node/

network failures

Problem: multiple versions of an object, that the

system needs to reconcile

Solution: use vectorial clocks to capture the casuality

among different versions of the same object

– If causal: older version can be forgotten – If concurrent: conflict exists, requiring reconciliation

69 Valeria Cardellini - SDCC 2017/18

SLIDE 71

Techniques used in Dynamo

70

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Valeria Cardellini - SDCC 2017/18

SLIDE 72

Sloppy quorum in Dynamo

R/W: minimum number of nodes that must participate

in a successful read/write operation

Setting R + W > N yields a quorum-like system

– The latency of a get() or put() operation is dictated by the slowest of the R or W replicas – R and W are usually configured to be less than N, to provide better latency – Typical configuration in Dynamo: (N, R, W) = (3, 2, 2)

Balances performance, durability, and availability
Sloppy quorum

– Due to partitions, quorums might not exist – Sloppy quorum: create transient replicas

N healthy nodes from the preference list (may not always

be the first N nodes encountered while walking the consistent hashing ring)

71 Valeria Cardellini - SDCC 2017/18

SLIDE 73

Put and get operations

put

– Coordinator generates new vector clock and writes the new version locally – Send to N nodes – Wait for response from W nodes

get

– Coordinator requests existing versions from N

Wait for response from R nodes

– If multiple versions, return all versions that are causally unrelated – Divergent versions are then reconciled – Reconciled version written back

72 Valeria Cardellini - SDCC 2017/18

SLIDE 74

Hinted handoff in Dynamo

Consider N = 3; if A is

temporarily down or unreachable, put will use D

D knows that the replica

belongs to A

Later, D detects A is alive

– Sends the replica to A – Removes the replica

73

Hinted handoff for transient failures
Again, “always writeable” principle

Valeria Cardellini - SDCC 2017/18

SLIDE 75

Techniques used in Dynamo

74

Problem Technique Advantage

Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Valeria Cardellini - SDCC 2017/18

SLIDE 76

Membership management

Administrator explicitly adds and removes nodes
Gossiping to propagate membership changes

– Eventually consistent view – O(1) hop overlay

75 Valeria Cardellini - SDCC 2017/18

SLIDE 77

Case study: Memcached

Free and open source key-value data

store, used as caching layer by Flickr, Twitter, Wikipedia, Youtube

High-performance, distributed

memory object caching system

– Generic in nature, but intended for use in speeding up dynamic web applications by alleviating load on data layer

Provides in-memory key-value store

for small chunks of arbitrary data (strings, objects) from results of database queries, API calls, or page rendering

API available for most languages
Available on AWS as ElastiCache

76 Valeria Cardellini - SDCC 2017/18

SLIDE 78

Case study: AWS DynamoDB

DynamoDB: NoSQL document and key-value data store

fully managed by AWS

– Global service (no choice of AWS region)

Goal: fast and predictable performance with seamless

scalability

77

AWS services for the data tier

Valeria Cardellini - SDCC 2017/18

SLIDE 79

Case study: DynamoDB

Consistency model

– Eventually consistent reads (default)

Maximizes read throughput

– Strongly consistent reads

Durability

– Writes continuously replicated to 3 AWS availability zones – Quorum acknowledgment – Persisted on disk

Automatic partitioning

– Automatically spreads the table data and traffic over multiple servers to handle the request capacity specified by the customer and the amount of data stored – Partitions and re-partitions the data as the table size grows

How to use it

– AWS Management Console or Amazon DynamoDB APIs

78 Valeria Cardellini - SDCC 2017/18

SLIDE 80

DynamoDB: example app

App: scalable URL shortener
Architecture

Valeria Cardellini - SDCC 2017/18 79

SLIDE 81

DynamoDB: example app

Architecture

Valeria Cardellini - SDCC 2017/18 80

SLIDE 82

DynamoDB: example app

Valeria Cardellini - SDCC 2017/18 81

Primary key Attributes (schema-less)

For full example, see http://bit.ly/2C4pAv0