file and metadata replication in xtreemfs bj rn kolbeck
play

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in


  1. File and Metadata Replication in XtreemFS Björn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS · 1

  2. Why Replicate ? – fault tolerance mail server – source repository – – bandwidth start 1,000 VMs in parallel – grid workflows – – latency local repositories (climate data, telescope images) – HSM: fast (disk) vs. slow (tape) replicas – File and Metadata Replication in XtreemFS · Björn Kolbeck 2

  3. The CAP Theorem – C onsistency A – A vailability – P artition tolerance C P – "dernier cri" A+P (eventual consistency) Brewer, Eric. T owards Robust Distributed Systems. PODC Keynote, 2004. File and Metadata Replication in XtreemFS · Björn Kolbeck 3

  4. CAP: Examples C+A A+P single server Amazon S3 A Linux HA Mercurial (one data center) Coda/AFS C P C+P distributed databases and file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 4

  5. File System: Expected Semantics App A App A create send foo.txt message ok FS ok open foo.txt App B App A App A create index.txt ok FS EEXISTS create index.txt App B File and Metadata Replication in XtreemFS · Björn Kolbeck 5

  6. File System: Consistency – linearizability (metadata and file data) communication between applications / users – – atomic operations (metadata only) unique file names (create, rename) – used by real-world applications – e.g. dovecot – expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 6

  7. File System: Do we really need consistency? – A+P = conflicts name clashes – multiple versions – – A+P vs. POSIX API can't resolve name clashes – no support for multiple versions – no interface to resolve conflicts – – A+P vs. Expectations developers assume consistency – synchronization – File and Metadata Replication in XtreemFS · Björn Kolbeck 7

  8. XtreemFS – distributed file system – "POSIX semantics" – object-based design – focus on replication (grid, cloud) File and Metadata Replication in XtreemFS · Björn Kolbeck 8

  9. T wo problems – one solution 1.Metadata replication problem: bottleneck – replication algorithms – "relax" requirements – our solution – 2.File data replication problem: scale – our solution – central lock service – 3.Other file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 9

  10. Metadata: How to replicate? Replicated State Machine (C+P) – Paxos + — – no primary/master – slow two round trips – no SPOF – needs distr. transaction – no extra latency on – difficult to implement failure File and Metadata Replication in XtreemFS · Björn Kolbeck 10

  11. Metadata: How to replicate? Primary/Backup (C+P) – replicated databases + — – fast – primary failover write = 1RT, read = local short interruption – no distr. transactions – primary = bottleneck – easy to implement File and Metadata Replication in XtreemFS · Björn Kolbeck 11

  12. 1. Metadata: How to replicate? Linux HA (C+A) – heartbeat signal + STONITH shared storage – Lustre failover + — – can be added "on-top" – still SPOFs: STONITH... – only for clusters – passive backups File and Metadata Replication in XtreemFS · Björn Kolbeck 12

  13. 1. Metadata: "relax" – read all replicas = sequential consistency stat , getattr , readdir (50 - 80% of all calls) – load balancing – upper bound on "staleness" – – write updates asynchronously ack after local write – max. window of data loss – similar to sync in PostgreSQL – File and Metadata Replication in XtreemFS · Björn Kolbeck 13

  14. 1. Metadata: Implementation in XtreemFS – map metadata on a flat index – replicate index with primary/backup use leases to elect primary – replicate insert/update/delete – – future work: weaker consistency for some ops – e.g. chmod, file size updates upper bound on "staleness" – File and Metadata Replication in XtreemFS · Björn Kolbeck 14

  15. 1. Metadata: Excursion — Flat index vs. Tree – database backend (BabuDB, LSM-Tree based) – ext4 (empty files) 500 BabuDB BabuDB 450 1904 ext4 ext4 2000 1799 385 400 357 350 1500 300 250 1000 200 150 500 100 50 0 0 duration (s) duration (s) linux kernel build IMAP trace (docevot imapstress) ➔ competitive performance File and Metadata Replication in XtreemFS · Björn Kolbeck 15

  16. 2. File Data: Expected Semantics – same as metadata but no atomic operations – – many applications require less read-only files / write-once – single process reading/writing – explicit fsync – File and Metadata Replication in XtreemFS · Björn Kolbeck 16

  17. 2. File Data: Implementation in XtreemFS – write-once: separate mechanism more efficient – support for partial replicas – large number of replicas – – read-write: primary/backup use leases for primary failover – requires service for lease coordination, – e.g. a lock service File and Metadata Replication in XtreemFS · Björn Kolbeck 17

  18. 2. File Data: Problem of Scale – Large number of storage servers – Large number of files – Primary per open file? – Primary per partition? – Long leases timeouts, e.g. 1min? File and Metadata Replication in XtreemFS · Björn Kolbeck 18

  19. 2. File Data: How to coordinate many leases? – Flease: decentralized lease coordination no central lock service – coordinated among storage servers holding a replica – – numbers: Google's Chubby: ~640 ops/sec – Zookeeper: ~7,000 ops/sec – Flease: ~5,000 ops/sec (3 nodes), – ~50,000 ops/sec (30 nodes) File and Metadata Replication in XtreemFS · Björn Kolbeck 19

  20. 2. File Data: Max. number of open files/server 120000 102058 Flease 100000 10 sec Zookeeper 17010 5 sec 2445 8500 80000 1 sec 1223 1700 60000 245 51029 40000 25515 20000 17010 14672 8505 7336 3668 3402 2445 1701 1223 489 245 0 0 10 20 30 40 50 60 lease timeout (s) 30 nodes, LAN File and Metadata Replication in XtreemFS · Björn Kolbeck 20

  21. Replication: Other File Systems Metadata File data Linux HA Linux HA Lustre CEPH - primary/backup + central cfg. service and monitoring - RAID 1 GlusterFS HDFS - write-once File and Metadata Replication in XtreemFS · Björn Kolbeck 21

  22. Replication: Lessons Learned – event-based design → no message re-ordering – separation of replication layers → simplified implementation, testing – no free lunch – consistency across data-centers is expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 22

  23. Thank You – http://www.xtreemfs.org – upcoming release 1.3 includes replication XtreemFS is developed within the XtreemOS project. XtreemOS is – funded by the European Commission under contract #FP6-033576. File and Metadata Replication in XtreemFS · Björn Kolbeck 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend