SLIDE 1
BabuDB: Fast and Efficient File System Metadata Storage Jan - - PowerPoint PPT Presentation
BabuDB: Fast and Efficient File System Metadata Storage Jan - - PowerPoint PPT Presentation
BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Bjrn Kolbeck, Felix Hupfeld Mikael Hgqvist Zuse Institute Berlin Google GmbH Zurich SNAPI 2010 Jan Stender Motivation Modern parallel / distributed file systems:
SLIDE 2
SLIDE 3
SNAPI 2010 · Jan Stender
Motivation
– B-tree-like data structures used for
metadata storage
–
ZFS, btrfs, Lustre, PVFS2 – Downsides:
–
Hard to implement and test, high code complexity
–
Multi-version B-trees even more complex
–
On-disk re-balancing expensive
SLIDE 4
SNAPI 2010 · Jan Stender
BabuDB
– Key-value store – FS metadata: key-value pairs stored in DB indices
SLIDE 5
SNAPI 2010 · Jan Stender
BabuDB: Index
SLIDE 6
SNAPI 2010 · Jan Stender
Example
SLIDE 7
SNAPI 2010 · Jan Stender
Example: Insertions
SLIDE 8
SNAPI 2010 · Jan Stender
Example: Insertions
SLIDE 9
SNAPI 2010 · Jan Stender
Example: Lookups
SLIDE 10
SNAPI 2010 · Jan Stender
Example: Lookups
SLIDE 11
SNAPI 2010 · Jan Stender
Example: Lookups
SLIDE 12
SNAPI 2010 · Jan Stender
Example: Lookups
SLIDE 13
SNAPI 2010 · Jan Stender
Example: Deletions
SLIDE 14
SNAPI 2010 · Jan Stender
Example: Deletions
SLIDE 15
SNAPI 2010 · Jan Stender
Example: Deletions
SLIDE 16
SNAPI 2010 · Jan Stender
Example: Deletions
SLIDE 17
SNAPI 2010 · Jan Stender
Example: Range Lookups
SLIDE 18
SNAPI 2010 · Jan Stender
Example: Range Lookups
SLIDE 19
SNAPI 2010 · Jan Stender
Example: Range Lookups
SLIDE 20
SNAPI 2010 · Jan Stender
Example: Range Lookups
SLIDE 21
SNAPI 2010 · Jan Stender
Example: Checkpoints
SLIDE 22
SNAPI 2010 · Jan Stender
Example: Checkpoints
SLIDE 23
SNAPI 2010 · Jan Stender
Example: Checkpoints
SLIDE 24
SNAPI 2010 · Jan Stender
Example: Checkpoints
SLIDE 25
SNAPI 2010 · Jan Stender
On-disk Index
– Sorted by Keys – Block index in RAM, blocks
mmap'ed
SLIDE 26
SNAPI 2010 · Jan Stender
BabuDB: Related Work
– Inspired by log-structured merge trees (LSM-trees)
–
Only one on-disk index
–
No „rolling merge“ – Made popular by Google Bigtable
–
Insert/lookup/merge similar as in Bigtable's T ablets
SLIDE 27
SNAPI 2010 · Jan Stender
BabuDB: Metadata Mapping
– Mapping a hierarchical directory tree to a flat
database index:
SLIDE 28
SNAPI 2010 · Jan Stender
BabuDB: Advantages
– Why BabuDB for File System Metadata?
–
Short-lived files ▪ 50% of all files deleted within 5 minutes
–
Atomic file system operations w/o locking or transactions ▪ e.g. rename
–
Directory content in contiguous disk regions ▪ Efficient readdir + stat
–
Snapshots ▪ No need for multi-version data structures
SLIDE 29
SNAPI 2010 · Jan Stender
BabuDB: Evaluation
– Linux kernel build
–
~10M calls: 44% stat, 40% open, 15% readlink, 1% others – Dovecot mail server
+ imaptest
–
~2M calls: 51% stat, 48% open, 1% others
seconds
Dovecot test 50 100 150 200 250 300 350 400 BabuDB ext4 Kernel build 200 400 600 800 1000 1200 1400 1600 1800 2000 BabuDB ext4
seconds
SLIDE 30
SNAPI 2010 · Jan Stender
BabuDB: Evaluation
– Listing directory content
SLIDE 31
SNAPI 2010 · Jan Stender
Summary
– BabuDB is ...
–
an efficient key-value store
–
- ptimized for file system
metadata but also suitable for other purposes
–
suitable for large-scale databases
–
available for Java and C++ under BSD license
–
used in the XtreemFS metadata server http://babudb.googlecode.com http://www.xtreemfs.org
SLIDE 32
SNAPI 2010 · Jan Stender
Thank you for your attention!
SLIDE 33
SNAPI 2010 · Jan Stender
Background: XtreemFS
–
XtreemFS: a distributed replicated Internet file system
–
part of the XtreemOS research project
–
developed since 2006 by partners from Germany, Spain and Italy www.xtreemfs.org
–
Object-based architecture:
–
MRC stores metadata
–
OSDs store pure file content as objects
–
Clients provide POSIX file system interface
SLIDE 34
SNAPI 2010 · Jan Stender
The XtreemOS Project
– Research project funded by the
European Commission
– 19 partners from Europe and China – XtreemFS is the data management
component
–
developed by ZIB, NEC HPC Europe, Barcelona Supercomputing Center and ICAR-CNR Italy
–
~ 3 years of development
–
first public release in August 2008
SLIDE 35
SNAPI 2010 · Jan Stender
XtreemFS: Overview
– What is XtreemFS?
–
a distributed and replicated POSIX compliant file system
–
- ff-the-shelve Servers – no
expensive hardware
–
servers in Java, runs on Linux / OS X / Solaris
–
client in C, runs on Linux / OS X / Windows
–
secure (X.509 and SSL)
–
easy to install and maintain
–
- pen source (GPL)
SLIDE 36