Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - - PowerPoint PPT Presentation
Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - - PowerPoint PPT Presentation
Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schttner Institut fr Informatik, Heinrich-Heine-Universitt Dsseldorf Outline Motivation The In-Memory
Outline
- Motivation
- The In-Memory Storage DXRAM
- Asynchronous Logging
- Fast Recovery
- Reorganization
- Conclusion
Motivation
- Large-scale interactive applications and online graph
computations:
- Billions of small data objects
- Dynamically expanding
- Read accesses dominate over write accesses
- Short latency required
- Example: Facebook
- More than one billion users
- More than 150 TB of data (2011)
- 70% of all data objects are smaller than 64 byte (2011)
26.September 2014
1/13
Traditional databases are at their limits
Motivation
- Common approach to meet discussed requirements: RAM-Caches
- Must be synchronized with secondary storage
- Refilling after failure very time consuming (Facebook outage 2011 -> 2,5h)
- Cache misses are expensive
- Another approach: Keeping all object always in RAM
- RAMCloud:
- Table-based data model
- 64 bit global ID-mapping via Hashtable
- Log-structured memory design
- Optimized for large files
26.September 2014
2/13
The In-Memory Storage DXRAM
26.September 2014
The In-Memory Storage DXRAM Overview
- DXRAM is a distributed in-memory system:
- Optimized to handle billions of small
- bjects
- Key-value data model with name service
- Transparent backup to SSD(HDD)
- Core Services:
- For management, storage and transfer of
key-value tuples (chunks)
- Minimal interface
- Extended Data Services:
- General services and extended data models
26.September 2014
3/13
The In-Memory Storage DXRAM
Chunks
26.September 2014
4/13
- Variable sizes
- Every chunk is initially stored on the creator, but can be migrated (hot spots)
- Every chunk has a 64 bit globally unique chunk ID (CID)
- First 16 bit: NodeID of the creator node
- Last 48 bit: Locally unique sequential ID
- Impact:
- Locality: Chunks that are created at the same location adjacent in time have similar CIDs
- Initial location is stored in CID:
- No lookup needed if chunks was not migrated
- After migration: New location must be stored elsewhere
- Applications cannot specify own IDs
LocalID NID CID
- Migrated CIDs are stored in ranges in a b-tree on dedicated nodes
- No entry -> chunk is still stored on creator
- Support for user-defined keys:
- Name service with a patricia-trie structure
26.September 2014
5/13
- Fast node lookup with a custom Chord-like super-
peer overlay
- 8 to 10% of all nodes are super-peers
- Super-peers do not store data but meta-data
- Meta-data is replicated on successors
- Every super-peer knows every other super-peer ->
Lookup with constant time complexity O(1)
- Every peer is assigned to one super-peer
- Fast node recovery
- Super-peers also store backup locations
- Distributed failure detection
- Super-peer coordinated recovery with multiple peers
The In-Memory Storage DXRAM
Global meta-data management
Asynchronous Logging
26.September 2014
Asynchronous Logging
SSD Utilization
- Characteristics of SSDs:
- SSDs write at least one page (4KB), pages are clustered to be accessed in
parallel
- SSDs cannot overwrite a single flash page, but delete a block (64 to 128
pages) and write on another
- It is faster to write sequentially than randomly on SSD
- Mixing write and read accesses slows the SSD down
- Life span: Limited number of program-erase cycles
- Consequences:
- Buffer write accesses
- Use a log to avoid deletions and to write sequentially
- Only read the log during recovery
26.September 2014
6/13
- Two-level log organization: One primary log and one secondary log
for every node requesting backups
- Idea: Store incoming backup requests as soon as possible on SSD to avoid
data loss and at the same time write as much as possible at once
- No need to store meta-data in RAM, because every entry is self describing
Secondary Log 1
Asynchronous Logging
Architecture
26.September 2014
7/13
Backup Requests
Buffer Write
Time-Out / Threshold
Primary Log Secondary Log 1 Secondary Log 1
Sort by NID
Asynchronous Logging
Architecture
26.September 2014
8/13
Backup Requests
Buffer Write
X Producer Network Threads 1 Consumer Writer Thread
RAM SSD
Primary Log Time-Out / Threshold
Write buffer:
- The write buffer stores chunks from potentially
every node: Is filled frequently
- Bundles backup requests (4KB)
- Decouples network threads (sync possible)
- Parallel access to write buffer
Writer thread:
- Flushes write buffer to primary log after time-out
- r (e.g. 0.5s) if threshold is reached (e.g. 16MB)
- Two bucket approach
Problem:
- To recover all data from one node the whole
primary log must be processed
Asynchronous Logging
Architecture
26.September 2014
9/13
Backup Requests
Buffer Write
RAM SSD
Time-Out / Threshold
Primary Log Secondary Log 1 Secondary Log 2 Secondary Log X ... Sec.Log Buffer 1 Sec.Log Buffer 2 ... Sec.Log Buffer X
Asynchronous Logging
Optimizations
- The write buffer is sorted by NID before writing to SSD
- If there is more than 4KB for one node, the data is written directly to the
corresponding secondary log
- Method: Combination of hashing and monitoring
- Clearing the primary log:
- Flush all secondary log buffers
- Set read pointer to write pointer
26.September 2014
10/13
Fast Recovery
26.September 2014
Fast Recovery
- Super-peer overlay:
- Fast and distributed failure detection (hierarchical heart beat protocol)
- Coordinated and purposeful peer recovery (super-peer knows all
corresponding backup locations)
- Recovery modes:
1. Every contacted peer recovers chunks locally (fastest, no data transfer) 2. All chunks are recovered and sent to one peer (1:1) 3. All chunks are recovered and sent to several peers (faster, but less locality, used by RAMCloud) 4. 1 and 2 combined: Recover locally and rebuild failed peer gradually
26.September 2014
11/13
Reorganization
26.September 2014
Reorganization
- Write buffers and primary log are cleared periodically
- Secondary logs are contiguously filled
- To free space of deleted or outdated entries the secondary logs have to be
reorganized
- Every peer reorganizes his logs independently
- Demands:
- Space-efficiency
- As little disruptive as possible
- Incremental operation to guarantee fast recovery
- Idea (inspired by LSF):
- Divide log into segments with fixed size
- Reorganize one segment after another
- Distinguish segments by access frequency (hot and cold zones)
- Decide which segment to reorganize by cost benefit ratio
26.September 2014
12/13
Conclusion
- Current status:
- DXRAM memory management tested on cluster with more than 5 billion objects
- Small object processing faster than RAMCloud
- Multithread buffer implemented and examined under worst-case scenario
- Logs fully functional with less complex reorganization scheme
- Node failure detection and initialization of recovery process tested
- Outlook:
- Implementation of LSF-like reorganization scheme with adapted cost-benefit
formula
- Replica placement (Copysets)
- Evaluation of complete recovery process
26.September 2014
13/13
Backup Slides
26.September 2014
The In-Memory Storage DXRAM
26.September 2014
14/13
In-memory data management
- Paging-like translation to local addresses
instead of hast table
- Space-efficient and fast
- Minimized internal fragmentation
- Small overhead: Only 7 bytes for chunks