File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Björn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS · 1

Why Replicate ? – fault tolerance mail server – source repository – – bandwidth start 1,000 VMs in parallel – grid workflows – – latency local repositories (climate data, telescope images) – HSM: fast (disk) vs. slow (tape) replicas – File and Metadata Replication in XtreemFS · Björn Kolbeck 2

The CAP Theorem – C onsistency A – A vailability – P artition tolerance C P – "dernier cri" A+P (eventual consistency) Brewer, Eric. T owards Robust Distributed Systems. PODC Keynote, 2004. File and Metadata Replication in XtreemFS · Björn Kolbeck 3

CAP: Examples C+A A+P single server Amazon S3 A Linux HA Mercurial (one data center) Coda/AFS C P C+P distributed databases and file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 4

File System: Expected Semantics App A App A create send foo.txt message ok FS ok open foo.txt App B App A App A create index.txt ok FS EEXISTS create index.txt App B File and Metadata Replication in XtreemFS · Björn Kolbeck 5

File System: Consistency – linearizability (metadata and file data) communication between applications / users – – atomic operations (metadata only) unique file names (create, rename) – used by real-world applications – e.g. dovecot – expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 6

File System: Do we really need consistency? – A+P = conflicts name clashes – multiple versions – – A+P vs. POSIX API can't resolve name clashes – no support for multiple versions – no interface to resolve conflicts – – A+P vs. Expectations developers assume consistency – synchronization – File and Metadata Replication in XtreemFS · Björn Kolbeck 7

XtreemFS – distributed file system – "POSIX semantics" – object-based design – focus on replication (grid, cloud) File and Metadata Replication in XtreemFS · Björn Kolbeck 8

T wo problems – one solution 1.Metadata replication problem: bottleneck – replication algorithms – "relax" requirements – our solution – 2.File data replication problem: scale – our solution – central lock service – 3.Other file systems File and Metadata Replication in XtreemFS · Björn Kolbeck 9

Metadata: How to replicate? Replicated State Machine (C+P) – Paxos + — – no primary/master – slow two round trips – no SPOF – needs distr. transaction – no extra latency on – difficult to implement failure File and Metadata Replication in XtreemFS · Björn Kolbeck 10

Metadata: How to replicate? Primary/Backup (C+P) – replicated databases + — – fast – primary failover write = 1RT, read = local short interruption – no distr. transactions – primary = bottleneck – easy to implement File and Metadata Replication in XtreemFS · Björn Kolbeck 11

1. Metadata: How to replicate? Linux HA (C+A) – heartbeat signal + STONITH shared storage – Lustre failover + — – can be added "on-top" – still SPOFs: STONITH... – only for clusters – passive backups File and Metadata Replication in XtreemFS · Björn Kolbeck 12

1. Metadata: "relax" – read all replicas = sequential consistency stat , getattr , readdir (50 - 80% of all calls) – load balancing – upper bound on "staleness" – – write updates asynchronously ack after local write – max. window of data loss – similar to sync in PostgreSQL – File and Metadata Replication in XtreemFS · Björn Kolbeck 13

1. Metadata: Implementation in XtreemFS – map metadata on a flat index – replicate index with primary/backup use leases to elect primary – replicate insert/update/delete – – future work: weaker consistency for some ops – e.g. chmod, file size updates upper bound on "staleness" – File and Metadata Replication in XtreemFS · Björn Kolbeck 14

1. Metadata: Excursion — Flat index vs. Tree – database backend (BabuDB, LSM-Tree based) – ext4 (empty files) 500 BabuDB BabuDB 450 1904 ext4 ext4 2000 1799 385 400 357 350 1500 300 250 1000 200 150 500 100 50 0 0 duration (s) duration (s) linux kernel build IMAP trace (docevot imapstress) ➔ competitive performance File and Metadata Replication in XtreemFS · Björn Kolbeck 15

2. File Data: Expected Semantics – same as metadata but no atomic operations – – many applications require less read-only files / write-once – single process reading/writing – explicit fsync – File and Metadata Replication in XtreemFS · Björn Kolbeck 16

2. File Data: Implementation in XtreemFS – write-once: separate mechanism more efficient – support for partial replicas – large number of replicas – – read-write: primary/backup use leases for primary failover – requires service for lease coordination, – e.g. a lock service File and Metadata Replication in XtreemFS · Björn Kolbeck 17

2. File Data: Problem of Scale – Large number of storage servers – Large number of files – Primary per open file? – Primary per partition? – Long leases timeouts, e.g. 1min? File and Metadata Replication in XtreemFS · Björn Kolbeck 18

2. File Data: How to coordinate many leases? – Flease: decentralized lease coordination no central lock service – coordinated among storage servers holding a replica – – numbers: Google's Chubby: ~640 ops/sec – Zookeeper: ~7,000 ops/sec – Flease: ~5,000 ops/sec (3 nodes), – ~50,000 ops/sec (30 nodes) File and Metadata Replication in XtreemFS · Björn Kolbeck 19

2. File Data: Max. number of open files/server 120000 102058 Flease 100000 10 sec Zookeeper 17010 5 sec 2445 8500 80000 1 sec 1223 1700 60000 245 51029 40000 25515 20000 17010 14672 8505 7336 3668 3402 2445 1701 1223 489 245 0 0 10 20 30 40 50 60 lease timeout (s) 30 nodes, LAN File and Metadata Replication in XtreemFS · Björn Kolbeck 20

Replication: Other File Systems Metadata File data Linux HA Linux HA Lustre CEPH - primary/backup + central cfg. service and monitoring - RAID 1 GlusterFS HDFS - write-once File and Metadata Replication in XtreemFS · Björn Kolbeck 21

Replication: Lessons Learned – event-based design → no message re-ordering – separation of replication layers → simplified implementation, testing – no free lunch – consistency across data-centers is expensive File and Metadata Replication in XtreemFS · Björn Kolbeck 22

Thank You – http://www.xtreemfs.org – upcoming release 1.3 includes replication XtreemFS is developed within the XtreemOS project. XtreemOS is – funded by the European Commission under contract #FP6-033576. File and Metadata Replication in XtreemFS · Björn Kolbeck 23

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in

Flease - Lease Coordination Without a Lock Server Bjrn Kolbeck , Mikael Hgqvist, Jan Stender,

XtreemFS a Distributed File System for Grids and Clouds Jan Stender Zuse Institute Berlin

BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Bjrn Kolbeck, Felix

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon,

File Management What is a file? Elements of file management File organization

XtreemFS a case for object-based storage in Grid data management Jan Stender, Zuse Institute

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

Loosely Time-Synchronized Snapshots in Object-Based File Systems Jan Stender, Mikael Hgqvist,

OceanSt Stor or S2600 Main Slides www.huawei.com HUAWEI TECHNOLOGIES CO., LTD. Contents

Distributed Systems Principles and Paradigms Chapter 11 (version October 15, 2007 ) Maarten van

Recall: virtual machines (VMs) Each guest VM runs a complete OS instance over an isolated

Primary/Backup Doug Woos Logistics notes Lab 2 posted HW1 up Friday Next weeks papers

Conceptual Models to Practical Implementations Dr Peter Popov Centre for Software Reliability

Handling Nondeterminism in Multi-Tiered Distributed Systems Joseph Slember Priya Narasimhan

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

High-speed Checkpointing for High Availability Brendan Cully brendan@cs.ubc.ca Department of

Sambuz

Useful Links

Newsletter

Mail Us

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse - PowerPoint PPT Presentation

File and Metadata Replication in XtreemFS Bjrn Kolbeck Zuse Institute Berlin File and Metadata Replication in XtreemFS 1 Why Replicate ? fault tolerance mail server source repository bandwidth start 1,000 VMs in

Flease - Lease Coordination Without a Lock Server Bjrn Kolbeck , Mikael Hgqvist, Jan Stender,

XtreemFS a Distributed File System for Grids and Clouds Jan Stender Zuse Institute Berlin

BabuDB: Fast and Efficient File System Metadata Storage Jan Stender, Bjrn Kolbeck, Felix

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon,

File Management What is a file? Elements of file management File organization

XtreemFS a case for object-based storage in Grid data management Jan Stender, Zuse Institute

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

UNSD metadata template / SDMX Metadata Structure Definition Elena De Jess, UNSD Standardized

Loosely Time-Synchronized Snapshots in Object-Based File Systems Jan Stender, Mikael Hgqvist,

OceanSt Stor or S2600 Main Slides www.huawei.com HUAWEI TECHNOLOGIES CO., LTD. Contents

Distributed Systems Principles and Paradigms Chapter 11 (version October 15, 2007 ) Maarten van

Recall: virtual machines (VMs) Each guest VM runs a complete OS instance over an isolated

Primary/Backup Doug Woos Logistics notes Lab 2 posted HW1 up Friday Next weeks papers

Conceptual Models to Practical Implementations Dr Peter Popov Centre for Software Reliability

Handling Nondeterminism in Multi-Tiered Distributed Systems Joseph Slember Priya Narasimhan

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

High-speed Checkpointing for High Availability Brendan Cully brendan@cs.ubc.ca Department of

Sambuz

Useful Links

Newsletter

Mail Us

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup