asynchronous logging and fast
play

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - PowerPoint PPT Presentation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schttner Institut fr Informatik, Heinrich-Heine-Universitt Dsseldorf Outline Motivation The In-Memory


  1. Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schöttner Institut für Informatik, Heinrich-Heine-Universität Düsseldorf

  2. Outline • Motivation • The In-Memory Storage DXRAM • Asynchronous Logging • Fast Recovery • Reorganization • Conclusion

  3. Motivation 1/13 • Large-scale interactive applications and online graph computations: • Billions of small data objects • Dynamically expanding • Read accesses dominate over write accesses • Short latency required • Example : Facebook • More than one billion users • More than 150 TB of data (2011) • 70% of all data objects are smaller than 64 byte (2011) Traditional databases are at their limits 26.September 2014

  4. Motivation 2/13 • Common approach to meet discussed requirements: RAM-Caches • Must be synchronized with secondary storage • Refilling after failure very time consuming (Facebook outage 2011 -> 2,5h) • Cache misses are expensive • Another approach: Keeping all object always in RAM • RAMCloud: • Table-based data model • 64 bit global ID-mapping via Hashtable • Log-structured memory design • Optimized for large files 26.September 2014

  5. The In-Memory Storage DXRAM 26.September 2014

  6. The In-Memory Storage DXRAM Overview 3/13 • DXRAM is a distributed in-memory system: • Optimized to handle billions of small objects • Key-value data model with name service • Transparent backup to SSD(HDD) • Core Services: • For management, storage and transfer of key-value tuples (chunks) • Minimal interface • Extended Data Services: • General services and extended data models 26.September 2014

  7. The In-Memory Storage DXRAM 4/13 Chunks • Variable sizes • Every chunk is initially stored on the creator, but can be migrated (hot spots) • Every chunk has a 64 bit globally unique chunk ID (CID) CID • First 16 bit: NodeID of the creator node NID LocalID • Last 48 bit: Locally unique sequential ID • Impact: • Locality: Chunks that are created at the same location adjacent in time have similar CIDs • Initial location is stored in CID: No lookup needed if chunks was not migrated • After migration: New location must be stored elsewhere • • Applications cannot specify own IDs Migrated CIDs are stored in ranges in a b-tree on dedicated nodes • • No entry -> chunk is still stored on creator Support for user-defined keys: • • Name service with a patricia-trie structure 26.September 2014

  8. The In-Memory Storage DXRAM 5/13 Global meta-data management • Fast node lookup with a custom Chord-like super- peer overlay • 8 to 10% of all nodes are super-peers • Super-peers do not store data but meta-data • Meta-data is replicated on successors • Every super-peer knows every other super-peer -> Lookup with constant time complexity O(1) • Every peer is assigned to one super-peer • Fast node recovery • Super-peers also store backup locations • Distributed failure detection • Super-peer coordinated recovery with multiple peers 26.September 2014

  9. Asynchronous Logging 26.September 2014

  10. Asynchronous Logging 6/13 SSD Utilization • Characteristics of SSDs: • SSDs write at least one page (4KB), pages are clustered to be accessed in parallel • SSDs cannot overwrite a single flash page, but delete a block (64 to 128 pages) and write on another • It is faster to write sequentially than randomly on SSD • Mixing write and read accesses slows the SSD down • Life span: Limited number of program-erase cycles • Consequences: • Buffer write accesses • Use a log to avoid deletions and to write sequentially • Only read the log during recovery 26.September 2014

  11. Asynchronous Logging 7/13 Architecture • Two-level log organization: One primary log and one secondary log for every node requesting backups • Idea: Store incoming backup requests as soon as possible on SSD to avoid data loss and at the same time write as much as possible at once • No need to store meta-data in RAM, because every entry is self describing Secondary Backup Time-Out / Primary Secondary Secondary Sort by NID Write Buffer Log 1 Requests Threshold Log Log 1 Log 1 26.September 2014

  12. Asynchronous Logging 8/13 Architecture Write buffer: • The write buffer stores chunks from potentially RAM SSD every node: Is filled frequently • Bundles backup requests (4KB) Decouples network threads (sync possible) • Parallel access to write buffer • Backup Time-Out / Primary Write Buffer Requests Threshold Log Writer thread: • Flushes write buffer to primary log after time-out or (e.g. 0.5s) if threshold is reached (e.g. 16MB) Two bucket approach • X Producer 1 Consumer Network Threads Writer Thread Problem: To recover all data from one node the whole • primary log must be processed 26.September 2014

  13. Asynchronous Logging 9/13 Architecture Backup Time-Out / RAM SSD Primary Log Write Buffer Requests Threshold Sec.Log Buffer 1 Secondary Log 1 Sec.Log Buffer 2 Secondary Log 2 ... ... Sec.Log Buffer X Secondary Log X 26.September 2014

  14. Asynchronous Logging 10/13 Optimizations • The write buffer is sorted by NID before writing to SSD • If there is more than 4KB for one node, the data is written directly to the corresponding secondary log • Method: Combination of hashing and monitoring • Clearing the primary log: • Flush all secondary log buffers • Set read pointer to write pointer 26.September 2014

  15. Fast Recovery 26.September 2014

  16. Fast Recovery 11/13 • Super-peer overlay: • Fast and distributed failure detection (hierarchical heart beat protocol) • Coordinated and purposeful peer recovery (super-peer knows all corresponding backup locations) • Recovery modes: 1. Every contacted peer recovers chunks locally (fastest, no data transfer) 2. All chunks are recovered and sent to one peer (1:1) 3. All chunks are recovered and sent to several peers (faster, but less locality, used by RAMCloud) 4. 1 and 2 combined: Recover locally and rebuild failed peer gradually 26.September 2014

  17. Reorganization 26.September 2014

  18. Reorganization 12/13 • Write buffers and primary log are cleared periodically • Secondary logs are contiguously filled • To free space of deleted or outdated entries the secondary logs have to be reorganized • Every peer reorganizes his logs independently • Demands: • Space-efficiency • As little disruptive as possible • Incremental operation to guarantee fast recovery • Idea (inspired by LSF): • Divide log into segments with fixed size • Reorganize one segment after another • Distinguish segments by access frequency (hot and cold zones) • Decide which segment to reorganize by cost benefit ratio 26.September 2014

  19. Conclusion 13/13 • Current status: • DXRAM memory management tested on cluster with more than 5 billion objects • Small object processing faster than RAMCloud • Multithread buffer implemented and examined under worst-case scenario • Logs fully functional with less complex reorganization scheme • Node failure detection and initialization of recovery process tested • Outlook: • Implementation of LSF-like reorganization scheme with adapted cost-benefit formula • Replica placement (Copysets) • Evaluation of complete recovery process 26.September 2014

  20. Backup Slides 26.September 2014

  21. The In-Memory Storage DXRAM 14/13 In-memory data management • Paging-like translation to local addresses instead of hast table • Space-efficient and fast • Minimized internal fragmentation • Small overhead: Only 7 bytes for chunks smaller than 256 bytes 26.September 2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend