Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - PowerPoint PPT Presentation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schöttner Institut für Informatik, Heinrich-Heine-Universität Düsseldorf

Outline • Motivation • The In-Memory Storage DXRAM • Asynchronous Logging • Fast Recovery • Reorganization • Conclusion

Motivation 1/13 • Large-scale interactive applications and online graph computations: • Billions of small data objects • Dynamically expanding • Read accesses dominate over write accesses • Short latency required • Example : Facebook • More than one billion users • More than 150 TB of data (2011) • 70% of all data objects are smaller than 64 byte (2011) Traditional databases are at their limits 26.September 2014

Motivation 2/13 • Common approach to meet discussed requirements: RAM-Caches • Must be synchronized with secondary storage • Refilling after failure very time consuming (Facebook outage 2011 -> 2,5h) • Cache misses are expensive • Another approach: Keeping all object always in RAM • RAMCloud: • Table-based data model • 64 bit global ID-mapping via Hashtable • Log-structured memory design • Optimized for large files 26.September 2014

The In-Memory Storage DXRAM 26.September 2014

The In-Memory Storage DXRAM Overview 3/13 • DXRAM is a distributed in-memory system: • Optimized to handle billions of small objects • Key-value data model with name service • Transparent backup to SSD(HDD) • Core Services: • For management, storage and transfer of key-value tuples (chunks) • Minimal interface • Extended Data Services: • General services and extended data models 26.September 2014

The In-Memory Storage DXRAM 4/13 Chunks • Variable sizes • Every chunk is initially stored on the creator, but can be migrated (hot spots) • Every chunk has a 64 bit globally unique chunk ID (CID) CID • First 16 bit: NodeID of the creator node NID LocalID • Last 48 bit: Locally unique sequential ID • Impact: • Locality: Chunks that are created at the same location adjacent in time have similar CIDs • Initial location is stored in CID: No lookup needed if chunks was not migrated • After migration: New location must be stored elsewhere • • Applications cannot specify own IDs Migrated CIDs are stored in ranges in a b-tree on dedicated nodes • • No entry -> chunk is still stored on creator Support for user-defined keys: • • Name service with a patricia-trie structure 26.September 2014

The In-Memory Storage DXRAM 5/13 Global meta-data management • Fast node lookup with a custom Chord-like super- peer overlay • 8 to 10% of all nodes are super-peers • Super-peers do not store data but meta-data • Meta-data is replicated on successors • Every super-peer knows every other super-peer -> Lookup with constant time complexity O(1) • Every peer is assigned to one super-peer • Fast node recovery • Super-peers also store backup locations • Distributed failure detection • Super-peer coordinated recovery with multiple peers 26.September 2014

Asynchronous Logging 26.September 2014

Asynchronous Logging 6/13 SSD Utilization • Characteristics of SSDs: • SSDs write at least one page (4KB), pages are clustered to be accessed in parallel • SSDs cannot overwrite a single flash page, but delete a block (64 to 128 pages) and write on another • It is faster to write sequentially than randomly on SSD • Mixing write and read accesses slows the SSD down • Life span: Limited number of program-erase cycles • Consequences: • Buffer write accesses • Use a log to avoid deletions and to write sequentially • Only read the log during recovery 26.September 2014

Asynchronous Logging 7/13 Architecture • Two-level log organization: One primary log and one secondary log for every node requesting backups • Idea: Store incoming backup requests as soon as possible on SSD to avoid data loss and at the same time write as much as possible at once • No need to store meta-data in RAM, because every entry is self describing Secondary Backup Time-Out / Primary Secondary Secondary Sort by NID Write Buffer Log 1 Requests Threshold Log Log 1 Log 1 26.September 2014

Asynchronous Logging 8/13 Architecture Write buffer: • The write buffer stores chunks from potentially RAM SSD every node: Is filled frequently • Bundles backup requests (4KB) Decouples network threads (sync possible) • Parallel access to write buffer • Backup Time-Out / Primary Write Buffer Requests Threshold Log Writer thread: • Flushes write buffer to primary log after time-out or (e.g. 0.5s) if threshold is reached (e.g. 16MB) Two bucket approach • X Producer 1 Consumer Network Threads Writer Thread Problem: To recover all data from one node the whole • primary log must be processed 26.September 2014

Asynchronous Logging 9/13 Architecture Backup Time-Out / RAM SSD Primary Log Write Buffer Requests Threshold Sec.Log Buffer 1 Secondary Log 1 Sec.Log Buffer 2 Secondary Log 2 ... ... Sec.Log Buffer X Secondary Log X 26.September 2014

Asynchronous Logging 10/13 Optimizations • The write buffer is sorted by NID before writing to SSD • If there is more than 4KB for one node, the data is written directly to the corresponding secondary log • Method: Combination of hashing and monitoring • Clearing the primary log: • Flush all secondary log buffers • Set read pointer to write pointer 26.September 2014

Fast Recovery 26.September 2014

Fast Recovery 11/13 • Super-peer overlay: • Fast and distributed failure detection (hierarchical heart beat protocol) • Coordinated and purposeful peer recovery (super-peer knows all corresponding backup locations) • Recovery modes: 1. Every contacted peer recovers chunks locally (fastest, no data transfer) 2. All chunks are recovered and sent to one peer (1:1) 3. All chunks are recovered and sent to several peers (faster, but less locality, used by RAMCloud) 4. 1 and 2 combined: Recover locally and rebuild failed peer gradually 26.September 2014

Reorganization 26.September 2014

Reorganization 12/13 • Write buffers and primary log are cleared periodically • Secondary logs are contiguously filled • To free space of deleted or outdated entries the secondary logs have to be reorganized • Every peer reorganizes his logs independently • Demands: • Space-efficiency • As little disruptive as possible • Incremental operation to guarantee fast recovery • Idea (inspired by LSF): • Divide log into segments with fixed size • Reorganize one segment after another • Distinguish segments by access frequency (hot and cold zones) • Decide which segment to reorganize by cost benefit ratio 26.September 2014

Conclusion 13/13 • Current status: • DXRAM memory management tested on cluster with more than 5 billion objects • Small object processing faster than RAMCloud • Multithread buffer implemented and examined under worst-case scenario • Logs fully functional with less complex reorganization scheme • Node failure detection and initialization of recovery process tested • Outlook: • Implementation of LSF-like reorganization scheme with adapted cost-benefit formula • Replica placement (Copysets) • Evaluation of complete recovery process 26.September 2014

Backup Slides 26.September 2014

The In-Memory Storage DXRAM 14/13 In-memory data management • Paging-like translation to local addresses instead of hast table • Space-efficient and fast • Minimized internal fragmentation • Small overhead: Only 7 bytes for chunks smaller than 256 bytes 26.September 2014

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - PowerPoint PPT Presentation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schttner Institut fr Informatik, Heinrich-Heine-Universitt Dsseldorf Outline Motivation The In-Memory

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

Debugging & Logging Java Logging Java has built-in support for logging Logs contain

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Australias actions to combat illegal logging Andrew Lieschke and Wayne Terpstra Compliance and

TomskGAZPROMgeofjzika Company Profjle { Logging while drilling { Production logging for reservoir

Industrial Automation 2019 Dr. Yvonne-Anne Pignolet, Dfinity Foundation Dr. Jean-Charles

Processes CS 416: Operating Systems Design, Spring 2011 Department of Computer Science Rutgers

CptS 360 (System Programming) Unit 10: Process Control Bob Lewis School of Engineering and

systems Department of Chemical Engineering I.I.T. Bombay, India Concept of system stability If

Evolving Machine Architectures Are Shifting Our Research AgendaWe Need To Keep Up! Jay

Introducing the PIC 16 Series and the 16F84A Chapter 2 Sections 1 8 Dr. Iyad Jafar Outline

Programmable Logic Controller(PLC) Seminar: Distributed Real-time Systems Outline Outline 2

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/1/07) I E S R C E O V U

Sambuz

Useful Links

Newsletter

Mail Us

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - PowerPoint PPT Presentation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schttner Institut fr Informatik, Heinrich-Heine-Universitt Dsseldorf Outline Motivation The In-Memory

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

Debugging &amp; Logging Java Logging Java has built-in support for logging Logs contain

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Australias actions to combat illegal logging Andrew Lieschke and Wayne Terpstra Compliance and

TomskGAZPROMgeofjzika Company Profjle { Logging while drilling { Production logging for reservoir

Industrial Automation 2019 Dr. Yvonne-Anne Pignolet, Dfinity Foundation Dr. Jean-Charles

Processes CS 416: Operating Systems Design, Spring 2011 Department of Computer Science Rutgers

CptS 360 (System Programming) Unit 10: Process Control Bob Lewis School of Engineering and

systems Department of Chemical Engineering I.I.T. Bombay, India Concept of system stability If

Evolving Machine Architectures Are Shifting Our Research AgendaWe Need To Keep Up! Jay

Introducing the PIC 16 Series and the 16F84A Chapter 2 Sections 1 8 Dr. Iyad Jafar Outline

Programmable Logic Controller(PLC) Seminar: Distributed Real-time Systems Outline Outline 2

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/1/07) I E S R C E O V U

Sambuz

Useful Links

Newsletter

Mail Us

Debugging & Logging Java Logging Java has built-in support for logging Logs contain