BabuDB: Fast and Efficient File System Metadata Storage Jan - - PowerPoint PPT Presentation

babudb fast and efficient file system metadata storage
SMART_READER_LITE
LIVE PREVIEW

BabuDB: Fast and Efficient File System Metadata Storage Jan - - PowerPoint PPT Presentation

BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Bjrn Kolbeck, Felix Hupfeld Mikael Hgqvist Zuse Institute Berlin Google GmbH Zurich SNAPI 2010 Jan Stender Motivation Modern parallel / distributed file systems:


slide-1
SLIDE 1

SNAPI 2010 · Jan Stender

BabuDB: Fast and Efficient File System Metadata Storage

Jan Stender, Björn Kolbeck, Mikael Högqvist

Zuse Institute Berlin

Felix Hupfeld

Google GmbH Zurich

slide-2
SLIDE 2

SNAPI 2010 · Jan Stender

Motivation

– Modern parallel / distributed file systems:

Huge numbers of files and directories

Many storage servers but few metadata servers – Examples:

Lustre, Panasas Active Scale, Google File System – Metadata access critical wrt. system performance

~75% of all file system calls are metadata accesses

Metadata servers are bottlenecks

slide-3
SLIDE 3

SNAPI 2010 · Jan Stender

Motivation

– B-tree-like data structures used for

metadata storage

ZFS, btrfs, Lustre, PVFS2 – Downsides:

Hard to implement and test, high code complexity

Multi-version B-trees even more complex

On-disk re-balancing expensive

slide-4
SLIDE 4

SNAPI 2010 · Jan Stender

BabuDB

– Key-value store – FS metadata: key-value pairs stored in DB indices

slide-5
SLIDE 5

SNAPI 2010 · Jan Stender

BabuDB: Index

slide-6
SLIDE 6

SNAPI 2010 · Jan Stender

Example

slide-7
SLIDE 7

SNAPI 2010 · Jan Stender

Example: Insertions

slide-8
SLIDE 8

SNAPI 2010 · Jan Stender

Example: Insertions

slide-9
SLIDE 9

SNAPI 2010 · Jan Stender

Example: Lookups

slide-10
SLIDE 10

SNAPI 2010 · Jan Stender

Example: Lookups

slide-11
SLIDE 11

SNAPI 2010 · Jan Stender

Example: Lookups

slide-12
SLIDE 12

SNAPI 2010 · Jan Stender

Example: Lookups

slide-13
SLIDE 13

SNAPI 2010 · Jan Stender

Example: Deletions

slide-14
SLIDE 14

SNAPI 2010 · Jan Stender

Example: Deletions

slide-15
SLIDE 15

SNAPI 2010 · Jan Stender

Example: Deletions

slide-16
SLIDE 16

SNAPI 2010 · Jan Stender

Example: Deletions

slide-17
SLIDE 17

SNAPI 2010 · Jan Stender

Example: Range Lookups

slide-18
SLIDE 18

SNAPI 2010 · Jan Stender

Example: Range Lookups

slide-19
SLIDE 19

SNAPI 2010 · Jan Stender

Example: Range Lookups

slide-20
SLIDE 20

SNAPI 2010 · Jan Stender

Example: Range Lookups

slide-21
SLIDE 21

SNAPI 2010 · Jan Stender

Example: Checkpoints

slide-22
SLIDE 22

SNAPI 2010 · Jan Stender

Example: Checkpoints

slide-23
SLIDE 23

SNAPI 2010 · Jan Stender

Example: Checkpoints

slide-24
SLIDE 24

SNAPI 2010 · Jan Stender

Example: Checkpoints

slide-25
SLIDE 25

SNAPI 2010 · Jan Stender

On-disk Index

– Sorted by Keys – Block index in RAM, blocks

mmap'ed

slide-26
SLIDE 26

SNAPI 2010 · Jan Stender

BabuDB: Related Work

– Inspired by log-structured merge trees (LSM-trees)

Only one on-disk index

No „rolling merge“ – Made popular by Google Bigtable

Insert/lookup/merge similar as in Bigtable's T ablets

slide-27
SLIDE 27

SNAPI 2010 · Jan Stender

BabuDB: Metadata Mapping

– Mapping a hierarchical directory tree to a flat

database index:

slide-28
SLIDE 28

SNAPI 2010 · Jan Stender

BabuDB: Advantages

– Why BabuDB for File System Metadata?

Short-lived files ▪ 50% of all files deleted within 5 minutes

Atomic file system operations w/o locking or transactions ▪ e.g. rename

Directory content in contiguous disk regions ▪ Efficient readdir + stat

Snapshots ▪ No need for multi-version data structures

slide-29
SLIDE 29

SNAPI 2010 · Jan Stender

BabuDB: Evaluation

– Linux kernel build

~10M calls: 44% stat, 40% open, 15% readlink, 1% others – Dovecot mail server

+ imaptest

~2M calls: 51% stat, 48% open, 1% others

seconds

Dovecot test 50 100 150 200 250 300 350 400 BabuDB ext4 Kernel build 200 400 600 800 1000 1200 1400 1600 1800 2000 BabuDB ext4

seconds

slide-30
SLIDE 30

SNAPI 2010 · Jan Stender

BabuDB: Evaluation

– Listing directory content

slide-31
SLIDE 31

SNAPI 2010 · Jan Stender

Summary

– BabuDB is ...

an efficient key-value store

  • ptimized for file system

metadata but also suitable for other purposes

suitable for large-scale databases

available for Java and C++ under BSD license

used in the XtreemFS metadata server http://babudb.googlecode.com http://www.xtreemfs.org

slide-32
SLIDE 32

SNAPI 2010 · Jan Stender

Thank you for your attention!

slide-33
SLIDE 33

SNAPI 2010 · Jan Stender

Background: XtreemFS

XtreemFS: a distributed replicated Internet file system

part of the XtreemOS research project

developed since 2006 by partners from Germany, Spain and Italy www.xtreemfs.org

Object-based architecture:

MRC stores metadata

OSDs store pure file content as objects

Clients provide POSIX file system interface

slide-34
SLIDE 34

SNAPI 2010 · Jan Stender

The XtreemOS Project

– Research project funded by the

European Commission

– 19 partners from Europe and China – XtreemFS is the data management

component

developed by ZIB, NEC HPC Europe, Barcelona Supercomputing Center and ICAR-CNR Italy

~ 3 years of development

first public release in August 2008

slide-35
SLIDE 35

SNAPI 2010 · Jan Stender

XtreemFS: Overview

– What is XtreemFS?

a distributed and replicated POSIX compliant file system

  • ff-the-shelve Servers – no

expensive hardware

servers in Java, runs on Linux / OS X / Solaris

client in C, runs on Linux / OS X / Windows

secure (X.509 and SSL)

easy to install and maintain

  • pen source (GPL)
slide-36
SLIDE 36

SNAPI 2010 · Jan Stender

File System Landscape

ext3, ZFS, NTFS NFS, SMB AFS/Coda Lustre, Panasas, GPFS, CEPH...

Internet Cluster FS/ Data Center Network FS/ Centralized PC

GDM "gridftp" Grid File System GFarm