File and Metadata Replication in XtreemFS · 1
File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - - PowerPoint PPT Presentation
File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - - PowerPoint PPT Presentation
File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in
File and Metadata Replication in XtreemFS · Björn Kolbeck 2
Why Replicate ?
– fault tolerance
–
mail server
–
source repository – bandwidth
–
start 1,000 VMs in parallel
–
grid workflows – latency
–
local repositories (climate data, telescope images)
–
HSM: fast (disk) vs. slow (tape) replicas
File and Metadata Replication in XtreemFS · Björn Kolbeck 3
The CAP Theorem
– Consistency – Availability
– Partition tolerance
– "dernier cri"
A+P (eventual consistency)
Brewer, Eric. T
- wards Robust Distributed Systems. PODC Keynote, 2004.
A C P
File and Metadata Replication in XtreemFS · Björn Kolbeck 4
CAP: Examples
A C P
C+A single server Linux HA (one data center) A+P Amazon S3 Mercurial Coda/AFS C+P distributed databases and file systems
File and Metadata Replication in XtreemFS · Björn Kolbeck 5
File System: Expected Semantics
App A FS App B App A create foo.txt send message
- pen
foo.txt
- k
- k
App A FS App B App A create index.txt create index.txt EEXISTS
- k
File and Metadata Replication in XtreemFS · Björn Kolbeck 6
File System: Consistency
– linearizability (metadata and file data)
–
communication between applications / users – atomic operations (metadata only)
–
unique file names (create, rename)
–
used by real-world applications e.g. dovecot – expensive
File and Metadata Replication in XtreemFS · Björn Kolbeck 7
File System: Do we really need consistency?
– A+P = conflicts
–
name clashes
–
multiple versions – A+P vs. POSIX API
–
can't resolve name clashes
–
no support for multiple versions
–
no interface to resolve conflicts – A+P vs. Expectations
–
developers assume consistency
–
synchronization
File and Metadata Replication in XtreemFS · Björn Kolbeck 8
XtreemFS
– distributed file system – object-based design – "POSIX semantics" – focus on replication
(grid, cloud)
File and Metadata Replication in XtreemFS · Björn Kolbeck 9
T wo problems – one solution
1.Metadata replication
–
problem: bottleneck
–
replication algorithms
–
"relax" requirements
–
- ur solution
2.File data replication
–
problem: scale
–
- ur solution
–
central lock service
3.Other file systems
File and Metadata Replication in XtreemFS · Björn Kolbeck 10
Metadata: How to replicate?
+
– no primary/master – no SPOF – no extra latency on
failure —
– slow
two round trips
– needs distr. transaction – difficult to implement
Replicated State Machine (C+P)
– Paxos
File and Metadata Replication in XtreemFS · Björn Kolbeck 11
Metadata: How to replicate?
+
– fast
write = 1RT, read = local
– no distr. transactions – easy to implement
—
– primary failover
short interruption
– primary = bottleneck
Primary/Backup (C+P)
– replicated databases
File and Metadata Replication in XtreemFS · Björn Kolbeck 12
- 1. Metadata: How to replicate?
+
– can be added "on-top"
—
– still SPOFs: STONITH... – only for clusters – passive backups
Linux HA (C+A)
– heartbeat signal + STONITH
shared storage
– Lustre failover
File and Metadata Replication in XtreemFS · Björn Kolbeck 13
- 1. Metadata: "relax"
– read all replicas = sequential consistency
–
stat, getattr, readdir (50 - 80% of all calls)
–
load balancing
–
upper bound on "staleness" – write updates asynchronously
–
ack after local write
–
- max. window of data loss
–
similar to sync in PostgreSQL
File and Metadata Replication in XtreemFS · Björn Kolbeck 14
- 1. Metadata: Implementation in XtreemFS
– map metadata on a flat index – replicate index with primary/backup
–
use leases to elect primary
–
replicate insert/update/delete – future work:
–
weaker consistency for some ops e.g. chmod, file size updates
–
upper bound on "staleness"
File and Metadata Replication in XtreemFS · Björn Kolbeck 15
- 1. Metadata: Excursion — Flat index vs. Tree
– database backend (BabuDB, LSM-Tree based) – ext4 (empty files) ➔ competitive performance
duration (s) 500 1000 1500 2000
1799 1904
BabuDB ext4 duration (s) 50 100 150 200 250 300 350 400 450 500
357 385
BabuDB ext4
linux kernel build IMAP trace (docevot imapstress)
File and Metadata Replication in XtreemFS · Björn Kolbeck 16
- 2. File Data: Expected Semantics
– same as metadata
–
but no atomic operations – many applications require less
–
read-only files / write-once
–
single process reading/writing
–
explicit fsync
File and Metadata Replication in XtreemFS · Björn Kolbeck 17
- 2. File Data: Implementation in XtreemFS
– write-once: separate mechanism
–
more efficient
–
support for partial replicas
–
large number of replicas – read-write: primary/backup
–
use leases for primary failover
–
requires service for lease coordination, e.g. a lock service
File and Metadata Replication in XtreemFS · Björn Kolbeck 18
- 2. File Data: Problem of Scale
– Large number of storage servers – Large number of files – Primary per open file? – Primary per partition? – Long leases timeouts, e.g. 1min?
File and Metadata Replication in XtreemFS · Björn Kolbeck 19
- 2. File Data: How to coordinate many leases?
– Flease: decentralized lease coordination
–
no central lock service
–
coordinated among storage servers holding a replica – numbers:
–
Google's Chubby: ~640 ops/sec
–
Zookeeper: ~7,000 ops/sec
–
Flease: ~5,000 ops/sec (3 nodes), ~50,000 ops/sec (30 nodes)
File and Metadata Replication in XtreemFS · Björn Kolbeck 20
- 2. File Data: Max. number of open files/server
10 20 30 40 50 60 20000 40000 60000 80000 100000 120000
245 489 1223 2445 3668 7336 14672 1701 3402 8505 17010 25515 51029 102058
Flease Zookeeper
lease timeout (s) 30 nodes, LAN 1 sec 1700 245 5 sec 8500 1223 10 sec 17010 2445
File and Metadata Replication in XtreemFS · Björn Kolbeck 21
Replication: Other File Systems
Linux HA Linux HA CEPH
- RAID 1
HDFS
- Metadata
File data Lustre primary/backup + central cfg. service and monitoring GlusterFS write-once
File and Metadata Replication in XtreemFS · Björn Kolbeck 22
Replication: Lessons Learned
– event-based design
→ no message re-ordering
– separation of replication layers
→ simplified implementation, testing
– no free lunch – consistency across data-centers is
expensive
File and Metadata Replication in XtreemFS · Björn Kolbeck 23
Thank You
– http://www.xtreemfs.org – upcoming release 1.3 includes replication
–