File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - - PowerPoint PPT Presentation

file systems and storage
SMART_READER_LITE
LIVE PREVIEW

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - - PowerPoint PPT Presentation

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 Why GFS? Store the web and other very large datasets Peculiar requirements Huge files Files can span multiple servers Coarse granularity blocks to


slide-1
SLIDE 1

File Systems and Storage

Marco Serafini

COMPSCI 532 Lecture 14

slide-2
SLIDE 2

2

slide-3
SLIDE 3

33

Why GFS?

  • Store “the web” and other very large datasets
  • Peculiar requirements
  • Huge files
  • Files can span multiple servers
  • Coarse granularity blocks to keep metadata manageable
  • Failures
  • Many servers à many failures
  • Workload
  • Concurrent append-only writes, reads mostly sequential
  • Q: Why is this workload common in a search engine?
slide-4
SLIDE 4

44

Design Choices

  • Focus on analytics
  • Optimized for bandwidth not latency
  • Weak consistency
  • Supports multiple concurrent appends to a file
  • Best-effort attempt to guarantee atomicity of each append
  • Minimal attempts to “fix” state after failures
  • No locks
  • How to deal with weak consistency
  • Application-level mechanisms to deal with inconsistent data
  • Clients cache only metadata
slide-5
SLIDE 5

55

Implementation

  • Distributed layer on top of Linux servers
  • Use local Linux file system to actually store data
slide-6
SLIDE 6

66

Master-Slave Architecture

  • Master
  • Keeps file chunk metadata (e.g. mapping to chunkservers)
  • Failure detection of chunkservers
  • Procedure
  • Client contacts master to get metadata (small size)
  • Client contacts chunkserver(s) to get data (large size)
  • Master is not bottleneck
slide-7
SLIDE 7

77

Architecture

slide-8
SLIDE 8

88

Advantages of Large Chunks

  • Small metadata
  • All metadata fits in memory at the master à no bottleneck
  • Clients cache lots of metadata à low load on master
  • Batching when transferring data
slide-9
SLIDE 9

99

Master Metadata

  • Persisted data
  • File and chunk namespaces
  • File to chunks mapping
  • Operation log
  • Stored externally for fault tolerance
  • Q: Why not simply restart master from scratch?
  • This is what MapReduce does, after all
  • Non-persisted data: Location of chunks
  • Fetched at startup from chunkservers
  • Updated periodically
slide-10
SLIDE 10

10

10

Operation Log

  • Persists state
  • Memory mapped file
  • Log is a WAL - we will discuss it
  • Trimmed using checkpoints
slide-11
SLIDE 11

11

Chunkserver Replication

  • Mutations are sent to all replicas
  • One replica is primary for a lease – time interval
  • Within that lease, it totally orders and sends to backups
  • After old lease expires, master assigns new primary
  • Separation of data and control flow
  • Data dissemination to all replicas (data flow)
  • Ordering through primary (control flow)
slide-12
SLIDE 12

12

12

Replication Protocol

  • Client
  • Finds replicas and primary (1,2)
  • Disseminates data to chunkservers (3)
  • Contacts primary replica for ordering (4)
  • Primary
  • Determines write offset and persists it to disk
  • Sends offset to backups (5)
  • Backups
  • Apply write and ack back to primary (6)
  • Primary
  • Acks to client (7)
  • Q: Quorums?
  • Q: Primary election and recovery?
slide-13
SLIDE 13

13

13

Weak Consistency

  • In presence of failures,
  • There can be inconsistencies (e.g. failed backup)
  • Client simply retries the write
  • Successful write (acknowledged back to client) is
  • Atomic: all data written
  • Consistent: same offset at all replica
  • This is because the primary proposes a specific offset
  • File contains
  • Stretches of “good” data with successful writes data
  • Stretches of “dirty” data inconsistent and/or duplicate data
slide-14
SLIDE 14

14

14

Implications for Applications

  • Applications must deal with inconsistency
  • Add checksums to data to detect dirty writes
  • Add unique record ids to detect duplication
  • Atomic file renaming after finishing a write (single writer)
  • More difficult to program!
  • But “good enough” for this use case
slide-15
SLIDE 15

15

15

Other Semantics Beyond FSs

  • Object store (e.g. AWS S3)
  • Originally conceived for web objects
  • Write-once objects
  • Offset reads
  • Often offer data replication
  • Block store (e.g. AWS EBS)
  • Mounted locally like a remote volume
  • Typically accessed using a file system
  • Not replicated
slide-16
SLIDE 16

16

Data Structures for Storage

slide-17
SLIDE 17

17

17

Storing Tables

  • How Good are B+ trees?
  • Q: Are they good for reading? Why?
  • Q: Are they good for writing? Why?
slide-18
SLIDE 18

18

Log Structured Merge Trees

  • Popular data structure for key-value stores
  • Bigtable, H-Base, RocksDB, LevelDB
  • Goals
  • Fast data ingestion
  • Leverage large memory for caching
  • Problems
  • Write and read amplification
slide-19
SLIDE 19

19

19

LSMT Data Structures

  • Memtable
  • Binary tree or skiplist à sorted by key
  • Receives writes and serves reads
  • Persistency through a Write Ahead Log
  • Log files (runs) arranged over multiple levels
  • L0: dump of memtable
  • Li: merge of multiple Li-1 runs
  • Goal: make disk accesses sequential
  • Writes are sequential
  • Merges of sorted data are sequential
slide-20
SLIDE 20

20

20

Write Operations

  • Store updates instead of modifying in place
  • New writes go to memtable
  • Periodically write memtable to L0 in sorted key order
  • When level Li becomes too large, merge its runs
  • Take two Li runs and merge (sequential)
  • Create new run Li+1
  • Iterate if needed (Li+1 full)
  • Runs at each level store overlapping keys
  • Each level has fewer and larger runs
slide-21
SLIDE 21

21

21

Read Operations

  • Search memtables and read caches (if available)
  • If not found, search runs level by level
  • Bloom filters – indices in each run
  • Binary search in each run or index
slide-22
SLIDE 22

22

22

Leveled LSMTs (e.g. RocksDB)

  • Difference with standard LSMT
  • Fixed number of runs per level, increasing for lower levels
  • From L1 downwards, every run stores a partition of keys
  • Goals
  • Split the cost of merging
  • Reads only need to access one run
  • New merge process
  • Take two Li runs and merge with the relevant Li+1 runs
  • Create new run Li+1 to replace the merged one
  • If new run too large, split and create a new Li+1 run
  • Iterate if needed (Li+1 full)
slide-23
SLIDE 23

23

Providing Durability

slide-24
SLIDE 24

24

24

Write Ahead Log

  • Goals
  • Atomicity: transactions are all or nothing
  • Durability (Persistency): completed transactions are not lost
  • Principle
  • Append modifications to a log on disk
  • Then apply them
  • After crash
  • Can redo transactions that committed
  • Can undo transactions that did not commit
slide-25
SLIDE 25

25

25

Example: WAL in LSMTs

  • Transactions
  • Create/Read/Update/Delete (CRUD) on a key-value pair
  • Append CUD operations to the WAL
  • Trimming the WAL?
  • Execute checkpoint
  • All operations reflected in the checkpoint removed from WAL
  • Recovery?
  • Read from the checkpoint, re-execute operations in WAL
  • ARIES: WAL in DBMSs (more complex than this)