File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - - PowerPoint PPT Presentation

file and metadata replication in xtreemfs bj rn kolbeck
SMART_READER_LITE
LIVE PREVIEW

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in


slide-1
SLIDE 1

File and Metadata Replication in XtreemFS · 1

File and Metadata Replication in XtreemFS Björn Kolbeck Zuse Institute Berlin

slide-2
SLIDE 2

File and Metadata Replication in XtreemFS · Björn Kolbeck 2

Why Replicate ?

– fault tolerance

mail server

source repository – bandwidth

start 1,000 VMs in parallel

grid workflows – latency

local repositories (climate data, telescope images)

HSM: fast (disk) vs. slow (tape) replicas

slide-3
SLIDE 3

File and Metadata Replication in XtreemFS · Björn Kolbeck 3

The CAP Theorem

– Consistency – Availability

– Partition tolerance

– "dernier cri"

A+P (eventual consistency)

Brewer, Eric. T

  • wards Robust Distributed Systems. PODC Keynote, 2004.

A C P

slide-4
SLIDE 4

File and Metadata Replication in XtreemFS · Björn Kolbeck 4

CAP: Examples

A C P

C+A single server Linux HA (one data center) A+P Amazon S3 Mercurial Coda/AFS C+P distributed databases and file systems

slide-5
SLIDE 5

File and Metadata Replication in XtreemFS · Björn Kolbeck 5

File System: Expected Semantics

App A FS App B App A create foo.txt send message

  • pen

foo.txt

  • k
  • k

App A FS App B App A create index.txt create index.txt EEXISTS

  • k
slide-6
SLIDE 6

File and Metadata Replication in XtreemFS · Björn Kolbeck 6

File System: Consistency

– linearizability (metadata and file data)

communication between applications / users – atomic operations (metadata only)

unique file names (create, rename)

used by real-world applications e.g. dovecot – expensive

slide-7
SLIDE 7

File and Metadata Replication in XtreemFS · Björn Kolbeck 7

File System: Do we really need consistency?

– A+P = conflicts

name clashes

multiple versions – A+P vs. POSIX API

can't resolve name clashes

no support for multiple versions

no interface to resolve conflicts – A+P vs. Expectations

developers assume consistency

synchronization

slide-8
SLIDE 8

File and Metadata Replication in XtreemFS · Björn Kolbeck 8

XtreemFS

– distributed file system – object-based design – "POSIX semantics" – focus on replication

(grid, cloud)

slide-9
SLIDE 9

File and Metadata Replication in XtreemFS · Björn Kolbeck 9

T wo problems – one solution

1.Metadata replication

problem: bottleneck

replication algorithms

"relax" requirements

  • ur solution

2.File data replication

problem: scale

  • ur solution

central lock service

3.Other file systems

slide-10
SLIDE 10

File and Metadata Replication in XtreemFS · Björn Kolbeck 10

Metadata: How to replicate?

+

– no primary/master – no SPOF – no extra latency on

failure —

– slow

two round trips

– needs distr. transaction – difficult to implement

Replicated State Machine (C+P)

– Paxos

slide-11
SLIDE 11

File and Metadata Replication in XtreemFS · Björn Kolbeck 11

Metadata: How to replicate?

+

– fast

write = 1RT, read = local

– no distr. transactions – easy to implement

– primary failover

short interruption

– primary = bottleneck

Primary/Backup (C+P)

– replicated databases

slide-12
SLIDE 12

File and Metadata Replication in XtreemFS · Björn Kolbeck 12

  • 1. Metadata: How to replicate?

+

– can be added "on-top"

– still SPOFs: STONITH... – only for clusters – passive backups

Linux HA (C+A)

– heartbeat signal + STONITH

shared storage

– Lustre failover

slide-13
SLIDE 13

File and Metadata Replication in XtreemFS · Björn Kolbeck 13

  • 1. Metadata: "relax"

– read all replicas = sequential consistency

stat, getattr, readdir (50 - 80% of all calls)

load balancing

upper bound on "staleness" – write updates asynchronously

ack after local write

  • max. window of data loss

similar to sync in PostgreSQL

slide-14
SLIDE 14

File and Metadata Replication in XtreemFS · Björn Kolbeck 14

  • 1. Metadata: Implementation in XtreemFS

– map metadata on a flat index – replicate index with primary/backup

use leases to elect primary

replicate insert/update/delete – future work:

weaker consistency for some ops e.g. chmod, file size updates

upper bound on "staleness"

slide-15
SLIDE 15

File and Metadata Replication in XtreemFS · Björn Kolbeck 15

  • 1. Metadata: Excursion — Flat index vs. Tree

– database backend (BabuDB, LSM-Tree based) – ext4 (empty files) ➔ competitive performance

duration (s) 500 1000 1500 2000

1799 1904

BabuDB ext4 duration (s) 50 100 150 200 250 300 350 400 450 500

357 385

BabuDB ext4

linux kernel build IMAP trace (docevot imapstress)

slide-16
SLIDE 16

File and Metadata Replication in XtreemFS · Björn Kolbeck 16

  • 2. File Data: Expected Semantics

– same as metadata

but no atomic operations – many applications require less

read-only files / write-once

single process reading/writing

explicit fsync

slide-17
SLIDE 17

File and Metadata Replication in XtreemFS · Björn Kolbeck 17

  • 2. File Data: Implementation in XtreemFS

– write-once: separate mechanism

more efficient

support for partial replicas

large number of replicas – read-write: primary/backup

use leases for primary failover

requires service for lease coordination, e.g. a lock service

slide-18
SLIDE 18

File and Metadata Replication in XtreemFS · Björn Kolbeck 18

  • 2. File Data: Problem of Scale

– Large number of storage servers – Large number of files – Primary per open file? – Primary per partition? – Long leases timeouts, e.g. 1min?

slide-19
SLIDE 19

File and Metadata Replication in XtreemFS · Björn Kolbeck 19

  • 2. File Data: How to coordinate many leases?

– Flease: decentralized lease coordination

no central lock service

coordinated among storage servers holding a replica – numbers:

Google's Chubby: ~640 ops/sec

Zookeeper: ~7,000 ops/sec

Flease: ~5,000 ops/sec (3 nodes), ~50,000 ops/sec (30 nodes)

slide-20
SLIDE 20

File and Metadata Replication in XtreemFS · Björn Kolbeck 20

  • 2. File Data: Max. number of open files/server

10 20 30 40 50 60 20000 40000 60000 80000 100000 120000

245 489 1223 2445 3668 7336 14672 1701 3402 8505 17010 25515 51029 102058

Flease Zookeeper

lease timeout (s) 30 nodes, LAN 1 sec 1700 245 5 sec 8500 1223 10 sec 17010 2445

slide-21
SLIDE 21

File and Metadata Replication in XtreemFS · Björn Kolbeck 21

Replication: Other File Systems

Linux HA Linux HA CEPH

  • RAID 1

HDFS

  • Metadata

File data Lustre primary/backup + central cfg. service and monitoring GlusterFS write-once

slide-22
SLIDE 22

File and Metadata Replication in XtreemFS · Björn Kolbeck 22

Replication: Lessons Learned

– event-based design

→ no message re-ordering

– separation of replication layers

→ simplified implementation, testing

– no free lunch – consistency across data-centers is

expensive

slide-23
SLIDE 23

File and Metadata Replication in XtreemFS · Björn Kolbeck 23

Thank You

– http://www.xtreemfs.org – upcoming release 1.3 includes replication

XtreemFS is developed within the XtreemOS project. XtreemOS is funded by the European Commission under contract #FP6-033576.