SLIDE 1 Data-Intensive Distributed Computing
Part 7: Mutable State (1/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 431/631 451/651 (Winter 2019) Adam Roegiest
Kira Systems
March 14, 2019
These slides are available at http://roegiest.com/bigdata-2019w/
SLIDE 2
Structure of the Course
“Core” framework features and algorithm design
Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining
SLIDE 3
The Fundamental Problem
We want to keep track of mutable state in a scalable manner MapReduce won’t do! Assumptions:
State organized in terms of logical records State unlikely to fit on single machine, must be distributed
Want more? Take a real distributed systems course!
SLIDE 4
The Fundamental Problem
We want to keep track of mutable state in a scalable manner Assumptions:
State organized in terms of logical records State unlikely to fit on single machine, must be distributed
SLIDE 5
What do RDBMSes provide?
Relational model with schemas Powerful, flexible query language Transactional semantics: ACID Rich ecosystem, lots of tool support
SLIDE 6 Source: www.flickr.com/photos/spencerdahl/6075142688/
RDBMSes: Pain Points
SLIDE 7
#1: Must design up front, painful to evolve
SLIDE 8
{ "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", "info": { "email": "my@place.com" } }
Is this really an integer? Is this really null? This should really be a list… Flexible design doesn’t mean no design! What keys? What values? Consistent field names?
JSON to the Rescue!
SLIDE 9 Source: Wikipedia (Tortoise)
#2: Pay for ACID!
SLIDE 10 #3: Cost!
Source: www.flickr.com/photos/gnusinn/3080378658/
SLIDE 11 What do RDBMSes provide?
Relational model with schemas Powerful, flexible query language Transactional semantics: ACID Rich ecosystem, lots of tool support
What if we want a la carte?
Source: www.flickr.com/photos/vidiot/18556565/
SLIDE 12
Features a la carte?
What if I’m willing to give up consistency for scalability? What if I’m willing to give up the relational model for flexibility? What if I just want a cheaper solution?
Enter… NoSQL!
SLIDE 13 Source: geekandpoke.typepad.com/geekandpoke/2011/01/nosql.html
SLIDE 14 NoSQL
Source: Cattell (2010). Scalable SQL and NoSQL Data Stores. SIGMOD Record.
(Not only SQL)
- 1. Horizontally scale “simple operations”
- 2. Replicate/distribute data over many servers
- 3. Simple call interface
- 4. Weaker concurrency model than ACID
- 5. Efficient use of distributed indexes and RAM
- 6. Flexible schemas
SLIDE 15
“web scale”
SLIDE 16
(Major) Types of NoSQL databases
Key-value stores Column-oriented databases Document stores Graph databases
SLIDE 17
Three Core Ideas
Partitioning (sharding)
To increase scalability and to decrease latency
Caching
To reduce latency
Replication
To increase robustness (availability) and to increase throughput
SLIDE 18 Source: Wikipedia (Keychain)
Key-Value Stores
SLIDE 19
Key-Value Stores: Data Model
Stores associations between keys and values Values can be primitive or complex: often opaque to store
Primitives: ints, strings, etc. Complex: JSON, HTML fragments, etc.
Keys are usually primitives
For example, ints, strings, raw bytes, etc.
SLIDE 20
Key-Value Stores: Operations
Optional operations:
Multi-get Multi-put Range queries Secondary index lookups
Very simple API:
Get – fetch value associated with key Put – set value associated with key
Consistency model:
Atomic single-record operations (usually) Cross-key operations: who knows?
SLIDE 21
Key-Value Stores: Implementation
Non-persistent:
Just a big in-memory hash table Examples: Redis, memcached
Persistent
Wrapper around a traditional RDBMS Examples: Voldemort
What if data doesn’t fit on a single machine?
SLIDE 22
Simple Solution: Partition!
Partition the key space across multiple machines
Let’s say, hash partitioning For n machines, store key k at machine h(k) mod n
Okay… But:
How do we know which physical machine to contact? How do we add a new machine to the cluster? What happens if a machine fails?
SLIDE 23 Clever Solution
Hash the keys Hash the machines also! Distributed hash tables!
(following combines ideas from several sources…)
SLIDE 24 h = 0 h = 2n – 1
SLIDE 25 h = 0 h = 2n – 1
Routing: Which machine holds the key?
Each machine holds pointers to predecessor and successor Send request to any node, gets routed to correct one in O(n) hops
Can we do better?
SLIDE 26 h = 0 h = 2n – 1
Routing: Which machine holds the key?
Each machine holds pointers to predecessor and successor Send request to any node, gets routed to correct one in O(log n) hops + “finger table” (+2, +4, +8, …)
SLIDE 27 h = 0 h = 2n – 1
Routing: Which machine holds the key? Simpler Solution
Service Registry
SLIDE 28 h = 0 h = 2n – 1
New machine joins: What happens?
How do we rebuild the predecessor, successor, finger tables?
Stoica et al. (2001). Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. SIGCOMM.
SLIDE 29 h = 0 h = 2n – 1
Machine fails: What happens? Solution: Replication
N = 3, replicate +1, –1
Covered! Covered!
SLIDE 30
Three Core Ideas
Partitioning (sharding)
To increase scalability and to decrease latency
Caching
To reduce latency
Replication
To increase robustness (availability) and to increase throughput
SLIDE 31
Another Refinement: Virtual Nodes
Don’t directly hash servers Create a large number of virtual nodes, map to physical servers
Better load redistribution in event of machine failure When new server joins, evenly shed load from other servers
SLIDE 32 Source: Wikipedia (Table)
Bigtable
SLIDE 33
Bigtable Applications
Gmail Google’s web crawl Google Earth Google Analytics Data source and data sink for MapReduce HBase is the open-source implementation…
SLIDE 34 Data Model
A table in Bigtable is a sparse, distributed, persistent multidimensional sorted map Map indexed by a row key, column key, and a timestamp
(row:string, column:string, time:int64) → uninterpreted byte array
Supports lookups, inserts, deletes
Single row transactions only
Image Source: Chang et al., OSDI 2006
SLIDE 35
Rows and Columns
Rows maintained in sorted lexicographic order
Applications can exploit this property for efficient row scans Row ranges dynamically partitioned into tablets
Columns grouped into column families
Column key = family:qualifier Column families provide locality hints Unbounded number of columns
At the end of the day, it’s all key-value pairs!
SLIDE 36 row, column family, column qualifier, timestamp value
Key-Values
SLIDE 37
In Memory On Disk Mutability Easy Mutability Hard Small Big
Okay, so how do we build it?
SLIDE 38
Log Structured Merge Trees
MemStore Writes Reads What happens when we run out of memory?
SLIDE 39
Log Structured Merge Trees
MemStore Writes Reads
Memory
Store
Disk
Immutable, indexed, persistent, key-value pairs What happens to the read path? Flush to disk
SLIDE 40
Log Structured Merge Trees
MemStore Writes Reads
Memory
Store
Disk
Immutable, indexed, persistent, key-value pairs What happens as more writes happen? Merge Flush to disk
SLIDE 41
Log Structured Merge Trees
MemStore Writes Reads
Memory
Store
Disk
Immutable, indexed, persistent, key-value pairs What happens to the read path? Store Store Store Merge Flush to disk
SLIDE 42
Log Structured Merge Trees
MemStore
Memory
Writes Reads Store
Disk
Store Store Store Immutable, indexed, persistent, key-value pairs What’s the next issue? Merge Flush to disk
SLIDE 43
Log Structured Merge Trees
MemStore
Memory
Writes Reads Store
Disk
Store Store Immutable, indexed, persistent, key-value pairs Merge Flush to disk
SLIDE 44
Log Structured Merge Trees
MemStore
Memory
Writes Reads Store
Disk
Immutable, indexed, persistent, key-value pairs Merge Flush to disk
SLIDE 45
Log Structured Merge Trees
MemStore
Memory
Writes Reads Store
Disk
Logging for persistence Immutable, indexed, persistent, key-value pairs Merge One final component… WAL Flush to disk
SLIDE 46
Log Structured Merge Trees
MemStore
Memory
Writes Reads Store
Disk
Store Store Merge Logging for persistence WAL Flush to disk Immutable, indexed, persistent, key-value pairs Compaction! The complete picture…
SLIDE 47
Log Structured Merge Trees
The complete picture… Okay, now how do we build a distributed version?
SLIDE 48
Bigtable building blocks
GFS SSTable Tablet Tablet Server Chubby
SLIDE 49
SSTable
Persistent, ordered immutable map from keys to values
Stored in GFS: replication “for free”
Supported operations:
Look up value associated with key Iterate key/value pairs within a key range
SLIDE 50
Tablet
Dynamically partitioned range of rows
Comprised of multiple SSTables
SSTable Tablet
aardvark - base
SSTable SSTable SSTable
SLIDE 51
Tablet Server
MemStore
Memory
Writes Reads SSTable
Disk
SSTable SSTable Logging for persistence WAL Flush to disk Immutable, indexed, persistent, key-value pairs Compaction!
SLIDE 52
Table
Comprised of multiple tablets
SSTables can be shared between tablets
SSTable Tablet
aardvark - base
SSTable SSTable SSTable Tablet
basic - database
SSTable SSTable
SLIDE 53
Tablet to Tablet Server Assignment
Each tablet is assigned to one tablet server at a time
Exclusively handles read and write requests to that tablet
What happens when a tablet grow too big? We need a lock service!
Region Server
What happens when a tablet server fails?
SLIDE 54
Bigtable building blocks
GFS SSTable Tablet Tablet Server Chubby
SLIDE 55
Architecture
Client library Bigtable master Tablet servers
SLIDE 56
Bigtable Master
Roles and responsibilities:
Assigns tablets to tablet servers Detects addition and removal of tablet servers Balances tablet server load Handles garbage collection Handles schema changes
Tablet structure changes:
Table creation/deletion (master initiated) Tablet merging (master initiated) Tablet splitting (tablet server initiated)
SLIDE 57
Compactions
Minor compaction
Converts the memtable into an SSTable Reduces memory usage and log traffic on restart
Merging compaction
Reads a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables
Major compaction
Merging compaction that results in only one SSTable No deletion records, only live data
SLIDE 58
Table
Comprised of multiple tables
SSTables can be shared between tablets
SSTable Tablet
aardvark - base
SSTable SSTable SSTable Tablet
basic - database
SSTable SSTable
SLIDE 59
Three Core Ideas
Partitioning (sharding)
To increase scalability and to decrease latency
Caching
To reduce latency
Replication
To increase robustness (availability) and to increase throughput
SLIDE 60 Image Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
HBase
SLIDE 61 Source: Wikipedia (Japanese rock garden)