Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - - PowerPoint PPT Presentation

asynchronous logging and fast
SMART_READER_LITE
LIVE PREVIEW

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed - - PowerPoint PPT Presentation

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage Kevin Beineke, Florian Klein, Michael Schttner Institut fr Informatik, Heinrich-Heine-Universitt Dsseldorf Outline Motivation The In-Memory


slide-1
SLIDE 1

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Kevin Beineke, Florian Klein, Michael Schöttner Institut für Informatik, Heinrich-Heine-Universität Düsseldorf

slide-2
SLIDE 2

Outline

  • Motivation
  • The In-Memory Storage DXRAM
  • Asynchronous Logging
  • Fast Recovery
  • Reorganization
  • Conclusion
slide-3
SLIDE 3

Motivation

  • Large-scale interactive applications and online graph

computations:

  • Billions of small data objects
  • Dynamically expanding
  • Read accesses dominate over write accesses
  • Short latency required
  • Example: Facebook
  • More than one billion users
  • More than 150 TB of data (2011)
  • 70% of all data objects are smaller than 64 byte (2011)

26.September 2014

1/13

Traditional databases are at their limits

slide-4
SLIDE 4

Motivation

  • Common approach to meet discussed requirements: RAM-Caches
  • Must be synchronized with secondary storage
  • Refilling after failure very time consuming (Facebook outage 2011 -> 2,5h)
  • Cache misses are expensive
  • Another approach: Keeping all object always in RAM
  • RAMCloud:
  • Table-based data model
  • 64 bit global ID-mapping via Hashtable
  • Log-structured memory design
  • Optimized for large files

26.September 2014

2/13

slide-5
SLIDE 5

The In-Memory Storage DXRAM

26.September 2014

slide-6
SLIDE 6

The In-Memory Storage DXRAM Overview

  • DXRAM is a distributed in-memory system:
  • Optimized to handle billions of small
  • bjects
  • Key-value data model with name service
  • Transparent backup to SSD(HDD)
  • Core Services:
  • For management, storage and transfer of

key-value tuples (chunks)

  • Minimal interface
  • Extended Data Services:
  • General services and extended data models

26.September 2014

3/13

slide-7
SLIDE 7

The In-Memory Storage DXRAM

Chunks

26.September 2014

4/13

  • Variable sizes
  • Every chunk is initially stored on the creator, but can be migrated (hot spots)
  • Every chunk has a 64 bit globally unique chunk ID (CID)
  • First 16 bit: NodeID of the creator node
  • Last 48 bit: Locally unique sequential ID
  • Impact:
  • Locality: Chunks that are created at the same location adjacent in time have similar CIDs
  • Initial location is stored in CID:
  • No lookup needed if chunks was not migrated
  • After migration: New location must be stored elsewhere
  • Applications cannot specify own IDs

LocalID NID CID

  • Migrated CIDs are stored in ranges in a b-tree on dedicated nodes
  • No entry -> chunk is still stored on creator
  • Support for user-defined keys:
  • Name service with a patricia-trie structure
slide-8
SLIDE 8

26.September 2014

5/13

  • Fast node lookup with a custom Chord-like super-

peer overlay

  • 8 to 10% of all nodes are super-peers
  • Super-peers do not store data but meta-data
  • Meta-data is replicated on successors
  • Every super-peer knows every other super-peer ->

Lookup with constant time complexity O(1)

  • Every peer is assigned to one super-peer
  • Fast node recovery
  • Super-peers also store backup locations
  • Distributed failure detection
  • Super-peer coordinated recovery with multiple peers

The In-Memory Storage DXRAM

Global meta-data management

slide-9
SLIDE 9

Asynchronous Logging

26.September 2014

slide-10
SLIDE 10

Asynchronous Logging

SSD Utilization

  • Characteristics of SSDs:
  • SSDs write at least one page (4KB), pages are clustered to be accessed in

parallel

  • SSDs cannot overwrite a single flash page, but delete a block (64 to 128

pages) and write on another

  • It is faster to write sequentially than randomly on SSD
  • Mixing write and read accesses slows the SSD down
  • Life span: Limited number of program-erase cycles
  • Consequences:
  • Buffer write accesses
  • Use a log to avoid deletions and to write sequentially
  • Only read the log during recovery

26.September 2014

6/13

slide-11
SLIDE 11
  • Two-level log organization: One primary log and one secondary log

for every node requesting backups

  • Idea: Store incoming backup requests as soon as possible on SSD to avoid

data loss and at the same time write as much as possible at once

  • No need to store meta-data in RAM, because every entry is self describing

Secondary Log 1

Asynchronous Logging

Architecture

26.September 2014

7/13

Backup Requests

Buffer Write

Time-Out / Threshold

Primary Log Secondary Log 1 Secondary Log 1

Sort by NID

slide-12
SLIDE 12

Asynchronous Logging

Architecture

26.September 2014

8/13

Backup Requests

Buffer Write

X Producer Network Threads 1 Consumer Writer Thread

RAM SSD

Primary Log Time-Out / Threshold

Write buffer:

  • The write buffer stores chunks from potentially

every node: Is filled frequently

  • Bundles backup requests (4KB)
  • Decouples network threads (sync possible)
  • Parallel access to write buffer

Writer thread:

  • Flushes write buffer to primary log after time-out
  • r (e.g. 0.5s) if threshold is reached (e.g. 16MB)
  • Two bucket approach

Problem:

  • To recover all data from one node the whole

primary log must be processed

slide-13
SLIDE 13

Asynchronous Logging

Architecture

26.September 2014

9/13

Backup Requests

Buffer Write

RAM SSD

Time-Out / Threshold

Primary Log Secondary Log 1 Secondary Log 2 Secondary Log X ... Sec.Log Buffer 1 Sec.Log Buffer 2 ... Sec.Log Buffer X

slide-14
SLIDE 14

Asynchronous Logging

Optimizations

  • The write buffer is sorted by NID before writing to SSD
  • If there is more than 4KB for one node, the data is written directly to the

corresponding secondary log

  • Method: Combination of hashing and monitoring
  • Clearing the primary log:
  • Flush all secondary log buffers
  • Set read pointer to write pointer

26.September 2014

10/13

slide-15
SLIDE 15

Fast Recovery

26.September 2014

slide-16
SLIDE 16

Fast Recovery

  • Super-peer overlay:
  • Fast and distributed failure detection (hierarchical heart beat protocol)
  • Coordinated and purposeful peer recovery (super-peer knows all

corresponding backup locations)

  • Recovery modes:

1. Every contacted peer recovers chunks locally (fastest, no data transfer) 2. All chunks are recovered and sent to one peer (1:1) 3. All chunks are recovered and sent to several peers (faster, but less locality, used by RAMCloud) 4. 1 and 2 combined: Recover locally and rebuild failed peer gradually

26.September 2014

11/13

slide-17
SLIDE 17

Reorganization

26.September 2014

slide-18
SLIDE 18

Reorganization

  • Write buffers and primary log are cleared periodically
  • Secondary logs are contiguously filled
  • To free space of deleted or outdated entries the secondary logs have to be

reorganized

  • Every peer reorganizes his logs independently
  • Demands:
  • Space-efficiency
  • As little disruptive as possible
  • Incremental operation to guarantee fast recovery
  • Idea (inspired by LSF):
  • Divide log into segments with fixed size
  • Reorganize one segment after another
  • Distinguish segments by access frequency (hot and cold zones)
  • Decide which segment to reorganize by cost benefit ratio

26.September 2014

12/13

slide-19
SLIDE 19

Conclusion

  • Current status:
  • DXRAM memory management tested on cluster with more than 5 billion objects
  • Small object processing faster than RAMCloud
  • Multithread buffer implemented and examined under worst-case scenario
  • Logs fully functional with less complex reorganization scheme
  • Node failure detection and initialization of recovery process tested
  • Outlook:
  • Implementation of LSF-like reorganization scheme with adapted cost-benefit

formula

  • Replica placement (Copysets)
  • Evaluation of complete recovery process

26.September 2014

13/13

slide-20
SLIDE 20

Backup Slides

26.September 2014

slide-21
SLIDE 21

The In-Memory Storage DXRAM

26.September 2014

14/13

In-memory data management

  • Paging-like translation to local addresses

instead of hast table

  • Space-efficient and fast
  • Minimized internal fragmentation
  • Small overhead: Only 7 bytes for chunks

smaller than 256 bytes