Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - - PowerPoint PPT Presentation

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 7: Mutable State (1/2) March 14, 2019 Adam Roegiest Kira Systems These slides are available at http://roegiest.com/bigdata-2019w/ This work is licensed under a


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 7: Mutable State (1/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019) Adam Roegiest

Kira Systems

March 14, 2019

These slides are available at http://roegiest.com/bigdata-2019w/

slide-2
SLIDE 2

Structure of the Course

“Core” framework features and algorithm design

Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

slide-3
SLIDE 3

The Fundamental Problem

We want to keep track of mutable state in a scalable manner MapReduce won’t do! Assumptions:

State organized in terms of logical records State unlikely to fit on single machine, must be distributed

Want more? Take a real distributed systems course!

slide-4
SLIDE 4

The Fundamental Problem

We want to keep track of mutable state in a scalable manner Assumptions:

State organized in terms of logical records State unlikely to fit on single machine, must be distributed

slide-5
SLIDE 5

What do RDBMSes provide?

Relational model with schemas Powerful, flexible query language Transactional semantics: ACID Rich ecosystem, lots of tool support

slide-6
SLIDE 6

Source: www.flickr.com/photos/spencerdahl/6075142688/

RDBMSes: Pain Points

slide-7
SLIDE 7

#1: Must design up front, painful to evolve

slide-8
SLIDE 8

{ "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", "info": { "email": "my@place.com" } }

Is this really an integer? Is this really null? This should really be a list… Flexible design doesn’t mean no design! What keys? What values? Consistent field names?

JSON to the Rescue!

slide-9
SLIDE 9

Source: Wikipedia (Tortoise)

#2: Pay for ACID!

slide-10
SLIDE 10

#3: Cost!

Source: www.flickr.com/photos/gnusinn/3080378658/

slide-11
SLIDE 11

What do RDBMSes provide?

Relational model with schemas Powerful, flexible query language Transactional semantics: ACID Rich ecosystem, lots of tool support

What if we want a la carte?

Source: www.flickr.com/photos/vidiot/18556565/

slide-12
SLIDE 12

Features a la carte?

What if I’m willing to give up consistency for scalability? What if I’m willing to give up the relational model for flexibility? What if I just want a cheaper solution?

Enter… NoSQL!

slide-13
SLIDE 13

Source: geekandpoke.typepad.com/geekandpoke/2011/01/nosql.html

slide-14
SLIDE 14

NoSQL

Source: Cattell (2010). Scalable SQL and NoSQL Data Stores. SIGMOD Record.

(Not only SQL)

  • 1. Horizontally scale “simple operations”
  • 2. Replicate/distribute data over many servers
  • 3. Simple call interface
  • 4. Weaker concurrency model than ACID
  • 5. Efficient use of distributed indexes and RAM
  • 6. Flexible schemas
slide-15
SLIDE 15

“web scale”

slide-16
SLIDE 16

(Major) Types of NoSQL databases

Key-value stores Column-oriented databases Document stores Graph databases

slide-17
SLIDE 17

Three Core Ideas

Partitioning (sharding)

To increase scalability and to decrease latency

Caching

To reduce latency

Replication

To increase robustness (availability) and to increase throughput

slide-18
SLIDE 18

Source: Wikipedia (Keychain)

Key-Value Stores

slide-19
SLIDE 19

Key-Value Stores: Data Model

Stores associations between keys and values Values can be primitive or complex: often opaque to store

Primitives: ints, strings, etc. Complex: JSON, HTML fragments, etc.

Keys are usually primitives

For example, ints, strings, raw bytes, etc.

slide-20
SLIDE 20

Key-Value Stores: Operations

Optional operations:

Multi-get Multi-put Range queries Secondary index lookups

Very simple API:

Get – fetch value associated with key Put – set value associated with key

Consistency model:

Atomic single-record operations (usually) Cross-key operations: who knows?

slide-21
SLIDE 21

Key-Value Stores: Implementation

Non-persistent:

Just a big in-memory hash table Examples: Redis, memcached

Persistent

Wrapper around a traditional RDBMS Examples: Voldemort

What if data doesn’t fit on a single machine?

slide-22
SLIDE 22

Simple Solution: Partition!

Partition the key space across multiple machines

Let’s say, hash partitioning For n machines, store key k at machine h(k) mod n

Okay… But:

How do we know which physical machine to contact? How do we add a new machine to the cluster? What happens if a machine fails?

slide-23
SLIDE 23

Clever Solution

Hash the keys Hash the machines also! Distributed hash tables!

(following combines ideas from several sources…)

slide-24
SLIDE 24

h = 0 h = 2n – 1

slide-25
SLIDE 25

h = 0 h = 2n – 1

Routing: Which machine holds the key?

Each machine holds pointers to predecessor and successor Send request to any node, gets routed to correct one in O(n) hops

Can we do better?

slide-26
SLIDE 26

h = 0 h = 2n – 1

Routing: Which machine holds the key?

Each machine holds pointers to predecessor and successor Send request to any node, gets routed to correct one in O(log n) hops + “finger table” (+2, +4, +8, …)

slide-27
SLIDE 27

h = 0 h = 2n – 1

Routing: Which machine holds the key? Simpler Solution

Service Registry

slide-28
SLIDE 28

h = 0 h = 2n – 1

New machine joins: What happens?

How do we rebuild the predecessor, successor, finger tables?

Stoica et al. (2001). Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. SIGCOMM.

  • Cf. Gossip Protocols
slide-29
SLIDE 29

h = 0 h = 2n – 1

Machine fails: What happens? Solution: Replication

N = 3, replicate +1, –1

Covered! Covered!

slide-30
SLIDE 30

Three Core Ideas

Partitioning (sharding)

To increase scalability and to decrease latency

Caching

To reduce latency

Replication

To increase robustness (availability) and to increase throughput

slide-31
SLIDE 31

Another Refinement: Virtual Nodes

Don’t directly hash servers Create a large number of virtual nodes, map to physical servers

Better load redistribution in event of machine failure When new server joins, evenly shed load from other servers

slide-32
SLIDE 32

Source: Wikipedia (Table)

Bigtable

slide-33
SLIDE 33

Bigtable Applications

Gmail Google’s web crawl Google Earth Google Analytics Data source and data sink for MapReduce HBase is the open-source implementation…

slide-34
SLIDE 34

Data Model

A table in Bigtable is a sparse, distributed, persistent multidimensional sorted map Map indexed by a row key, column key, and a timestamp

(row:string, column:string, time:int64) → uninterpreted byte array

Supports lookups, inserts, deletes

Single row transactions only

Image Source: Chang et al., OSDI 2006

slide-35
SLIDE 35

Rows and Columns

Rows maintained in sorted lexicographic order

Applications can exploit this property for efficient row scans Row ranges dynamically partitioned into tablets

Columns grouped into column families

Column key = family:qualifier Column families provide locality hints Unbounded number of columns

At the end of the day, it’s all key-value pairs!

slide-36
SLIDE 36

row, column family, column qualifier, timestamp value

Key-Values

slide-37
SLIDE 37

In Memory On Disk Mutability Easy Mutability Hard Small Big

Okay, so how do we build it?

slide-38
SLIDE 38

Log Structured Merge Trees

MemStore Writes Reads What happens when we run out of memory?

slide-39
SLIDE 39

Log Structured Merge Trees

MemStore Writes Reads

Memory

Store

Disk

Immutable, indexed, persistent, key-value pairs What happens to the read path? Flush to disk

slide-40
SLIDE 40

Log Structured Merge Trees

MemStore Writes Reads

Memory

Store

Disk

Immutable, indexed, persistent, key-value pairs What happens as more writes happen? Merge Flush to disk

slide-41
SLIDE 41

Log Structured Merge Trees

MemStore Writes Reads

Memory

Store

Disk

Immutable, indexed, persistent, key-value pairs What happens to the read path? Store Store Store Merge Flush to disk

slide-42
SLIDE 42

Log Structured Merge Trees

MemStore

Memory

Writes Reads Store

Disk

Store Store Store Immutable, indexed, persistent, key-value pairs What’s the next issue? Merge Flush to disk

slide-43
SLIDE 43

Log Structured Merge Trees

MemStore

Memory

Writes Reads Store

Disk

Store Store Immutable, indexed, persistent, key-value pairs Merge Flush to disk

slide-44
SLIDE 44

Log Structured Merge Trees

MemStore

Memory

Writes Reads Store

Disk

Immutable, indexed, persistent, key-value pairs Merge Flush to disk

slide-45
SLIDE 45

Log Structured Merge Trees

MemStore

Memory

Writes Reads Store

Disk

Logging for persistence Immutable, indexed, persistent, key-value pairs Merge One final component… WAL Flush to disk

slide-46
SLIDE 46

Log Structured Merge Trees

MemStore

Memory

Writes Reads Store

Disk

Store Store Merge Logging for persistence WAL Flush to disk Immutable, indexed, persistent, key-value pairs Compaction! The complete picture…

slide-47
SLIDE 47

Log Structured Merge Trees

The complete picture… Okay, now how do we build a distributed version?

slide-48
SLIDE 48

Bigtable building blocks

GFS SSTable Tablet Tablet Server Chubby

slide-49
SLIDE 49

SSTable

Persistent, ordered immutable map from keys to values

Stored in GFS: replication “for free”

Supported operations:

Look up value associated with key Iterate key/value pairs within a key range

slide-50
SLIDE 50

Tablet

Dynamically partitioned range of rows

Comprised of multiple SSTables

SSTable Tablet

aardvark - base

SSTable SSTable SSTable

slide-51
SLIDE 51

Tablet Server

MemStore

Memory

Writes Reads SSTable

Disk

SSTable SSTable Logging for persistence WAL Flush to disk Immutable, indexed, persistent, key-value pairs Compaction!

slide-52
SLIDE 52

Table

Comprised of multiple tablets

SSTables can be shared between tablets

SSTable Tablet

aardvark - base

SSTable SSTable SSTable Tablet

basic - database

SSTable SSTable

slide-53
SLIDE 53

Tablet to Tablet Server Assignment

Each tablet is assigned to one tablet server at a time

Exclusively handles read and write requests to that tablet

What happens when a tablet grow too big? We need a lock service!

Region Server

What happens when a tablet server fails?

slide-54
SLIDE 54

Bigtable building blocks

GFS SSTable Tablet Tablet Server Chubby

slide-55
SLIDE 55

Architecture

Client library Bigtable master Tablet servers

slide-56
SLIDE 56

Bigtable Master

Roles and responsibilities:

Assigns tablets to tablet servers Detects addition and removal of tablet servers Balances tablet server load Handles garbage collection Handles schema changes

Tablet structure changes:

Table creation/deletion (master initiated) Tablet merging (master initiated) Tablet splitting (tablet server initiated)

slide-57
SLIDE 57

Compactions

Minor compaction

Converts the memtable into an SSTable Reduces memory usage and log traffic on restart

Merging compaction

Reads a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables

Major compaction

Merging compaction that results in only one SSTable No deletion records, only live data

slide-58
SLIDE 58

Table

Comprised of multiple tables

SSTables can be shared between tablets

SSTable Tablet

aardvark - base

SSTable SSTable SSTable Tablet

basic - database

SSTable SSTable

slide-59
SLIDE 59

Three Core Ideas

Partitioning (sharding)

To increase scalability and to decrease latency

Caching

To reduce latency

Replication

To increase robustness (availability) and to increase throughput

slide-60
SLIDE 60

Image Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

HBase

slide-61
SLIDE 61

Source: Wikipedia (Japanese rock garden)