File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - PowerPoint PPT Presentation

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14

Why GFS? • Store “the web” and other very large datasets • Peculiar requirements • Huge files • Files can span multiple servers • Coarse granularity blocks to keep metadata manageable • Failures • Many servers à many failures • Workload • Concurrent append-only writes, reads mostly sequential • Q: Why is this workload common in a search engine? 3 3

Design Choices • Focus on analytics • Optimized for bandwidth not latency • Weak consistency • Supports multiple concurrent appends to a file • Best-effort attempt to guarantee atomicity of each append • Minimal attempts to “fix” state after failures • No locks • How to deal with weak consistency • Application-level mechanisms to deal with inconsistent data • Clients cache only metadata 4 4

Implementation • Distributed layer on top of Linux servers • Use local Linux file system to actually store data 5 5

Master-Slave Architecture • Master • Keeps file chunk metadata (e.g. mapping to chunkservers) • Failure detection of chunkservers • Procedure • Client contacts master to get metadata (small size) • Client contacts chunkserver(s) to get data (large size) • Master is not bottleneck 6 6

Architecture 7 7

Advantages of Large Chunks • Small metadata • All metadata fits in memory at the master à no bottleneck • Clients cache lots of metadata à low load on master • Batching when transferring data 8 8

Master Metadata • Persisted data • File and chunk namespaces • File to chunks mapping • Operation log • Stored externally for fault tolerance • Q: Why not simply restart master from scratch? • This is what MapReduce does, after all • Non-persisted data: Location of chunks • Fetched at startup from chunkservers • Updated periodically 9 9

Operation Log • Persists state • Memory mapped file • Log is a WAL - we will discuss it • Trimmed using checkpoints 10 10

Chunkserver Replication • Mutations are sent to all replicas • One replica is primary for a lease – time interval • Within that lease, it totally orders and sends to backups • After old lease expires, master assigns new primary • Separation of data and control flow • Data dissemination to all replicas (data flow) • Ordering through primary (control flow) 11

Replication Protocol • Client • Finds replicas and primary (1,2) • Disseminates data to chunkservers (3) • Contacts primary replica for ordering (4) • Primary • Determines write offset and persists it to disk • Sends offset to backups (5) • Backups • Apply write and ack back to primary (6) • Primary • Acks to client (7) • Q: Quorums? • Q: Primary election and recovery? 12 12

Weak Consistency • In presence of failures, • There can be inconsistencies (e.g. failed backup) • Client simply retries the write • Successful write (acknowledged back to client) is • Atomic: all data written • Consistent: same offset at all replica • This is because the primary proposes a specific offset • File contains • Stretches of “good” data with successful writes data • Stretches of “dirty” data inconsistent and/or duplicate data 13 13

Implications for Applications • Applications must deal with inconsistency • Add checksums to data to detect dirty writes • Add unique record ids to detect duplication • Atomic file renaming after finishing a write (single writer) • More difficult to program! • But “good enough” for this use case 14 14

Other Semantics Beyond FSs • Object store (e.g. AWS S3) • Originally conceived for web objects • Write-once objects • Offset reads • Often offer data replication • Block store (e.g. AWS EBS) • Mounted locally like a remote volume • Typically accessed using a file system • Not replicated 15 15

Data Structures for Storage 16

Storing Tables • How Good are B+ trees? • Q: Are they good for reading? Why? • Q: Are they good for writing? Why? 17 17

Log Structured Merge Trees • Popular data structure for key-value stores • Bigtable, H-Base, RocksDB, LevelDB • Goals • Fast data ingestion • Leverage large memory for caching • Problems • Write and read amplification 18

LSMT Data Structures • Memtable • Binary tree or skiplist à sorted by key • Receives writes and serves reads • Persistency through a Write Ahead Log • Log files ( runs ) arranged over multiple levels • L 0 : dump of memtable • L i : merge of multiple L i-1 runs • Goal: make disk accesses sequential • Writes are sequential • Merges of sorted data are sequential 19 19

Write Operations • Store updates instead of modifying in place • New writes go to memtable • Periodically write memtable to L 0 in sorted key order • When level L i becomes too large, merge its runs • Take two L i runs and merge (sequential) • Create new run L i+1 • Iterate if needed ( L i+1 full) • Runs at each level store overlapping keys • Each level has fewer and larger runs 20 20

Read Operations • Search memtables and read caches (if available) • If not found, search runs level by level • Bloom filters – indices in each run • Binary search in each run or index 21 21

Leveled LSMTs (e.g. RocksDB) • Difference with standard LSMT • Fixed number of runs per level, increasing for lower levels • From L 1 downwards, every run stores a partition of keys • Goals • Split the cost of merging • Reads only need to access one run • New merge process •Take two L i runs and merge with the relevant L i+1 runs • Create new run L i+1 to replace the merged one • If new run too large, split and create a new L i+1 run • Iterate if needed ( L i+1 full) 22 22

Providing Durability 23

Write Ahead Log • Goals • Atomicity: transactions are all or nothing • Durability (Persistency): completed transactions are not lost • Principle • Append modifications to a log on disk • Then apply them • After crash • Can redo transactions that committed • Can undo transactions that did not commit 24 24

Example: WAL in LSMTs • Transactions • Create/Read/Update/Delete (CRUD) on a key-value pair • Append CUD operations to the WAL • Trimming the WAL? • Execute checkpoint • All operations reflected in the checkpoint removed from WAL • Recovery? • Read from the checkpoint, re-execute operations in WAL • ARIES: WAL in DBMSs (more complex than this) 25 25

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - PowerPoint PPT Presentation

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 Why GFS? Store the web and other very large datasets Peculiar requirements Huge files Files can span multiple servers Coarse granularity blocks to

Storage and File Structure December 12, 2008 Storage and File Structure Magnetic Discs RAID

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Chapter 6: File Systems File systems n Files n Directories & naming n File system

Chapter 6: File Systems File systems Files Directories & naming File system

Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks

CPSC 410/611: File Management What is a file? Elements of file management

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

File Systems: Consistency Issues 1 File Systems: Consistency Issues File systems maintain many

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

What if... There is no file with the name given to the File constructor: new File

File Management File Management File is a named collection of information The file

GFS Arvind Krishnamurthy (based on slides from Tom Anderson & Dan Ports) Google Stack

Health-Related Services 101 September 22, 2020 Agenda Welcome Overview of

Child Nutrition and Opportunities for Nutrition Educators Jessica Donze Black, RD, MPH Stephanie

AN EMPIRICAL INVESTIGATION OF THE IMPACT OF DIFFERENT METHODS FOR SYNTHESISING EVIDENCE IN A

FOSDEM 2020 PostgreSQL devroom Brussels ALEXANDER KUKUSHKIN 02-02-2020 Put images in the grey

Secondary reads: the good and the bad Bartomiej Noga Agenda Read Preference

Factors influencing teachers' professional use of ICT in primary and secondary schools in Spain

Bigtable David Wyrobnik, MEng Overview What is Bigtable? Data Model API

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 - PowerPoint PPT Presentation

File Systems and Storage Marco Serafini COMPSCI 532 Lecture 14 2 Why GFS? Store the web and other very large datasets Peculiar requirements Huge files Files can span multiple servers Coarse granularity blocks to

Storage and File Structure December 12, 2008 Storage and File Structure Magnetic Discs RAID

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Chapter 6: File Systems File systems n Files n Directories &amp; naming n File system

Chapter 6: File Systems File systems Files Directories &amp; naming File system

Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks

CPSC 410/611: File Management What is a file? Elements of file management

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

File Systems: Consistency Issues 1 File Systems: Consistency Issues File systems maintain many

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

What if... There is no file with the name given to the File constructor: new File

File Management File Management File is a named collection of information The file

GFS Arvind Krishnamurthy (based on slides from Tom Anderson &amp; Dan Ports) Google Stack

Health-Related Services 101 September 22, 2020 Agenda Welcome Overview of

Child Nutrition and Opportunities for Nutrition Educators Jessica Donze Black, RD, MPH Stephanie

AN EMPIRICAL INVESTIGATION OF THE IMPACT OF DIFFERENT METHODS FOR SYNTHESISING EVIDENCE IN A

FOSDEM 2020 PostgreSQL devroom Brussels ALEXANDER KUKUSHKIN 02-02-2020 Put images in the grey

Secondary reads: the good and the bad Bartomiej Noga Agenda Read Preference

Factors influencing teachers' professional use of ICT in primary and secondary schools in Spain

Bigtable David Wyrobnik, MEng Overview What is Bigtable? Data Model API

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Chapter 6: File Systems File systems n Files n Directories & naming n File system

Chapter 6: File Systems File systems Files Directories & naming File system

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

GFS Arvind Krishnamurthy (based on slides from Tom Anderson & Dan Ports) Google Stack