xtreemfs high performance network file system clients and
play

XtreemFS: high- performance network file system clients and servers - PowerPoint PPT Presentation

XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon, NEC Deutschland GmbH mgordon@hpce.nec.com HIGH PERFORMANCE Why userspace? COMPUTING 05/04/09 File systems traditionally implemented in the


  1. XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon, NEC Deutschland GmbH mgordon@hpce.nec.com

  2. HIGH PERFORMANCE Why userspace? COMPUTING 05/04/09 File systems traditionally implemented in the kernel for ❚ performance, control Some advantages doing things in userspace: ❚ ❚ High-level languages: Python, Ruby, et al. for prototyping, then C++ ( → tool support, reduced code footprint, etc.) ❚ Protection: kernel-userspace bridges (Dokan, FUSE) are fairly stable, file system can crash without requiring a reboot ❚ Porting: one common kernel->userspace upcall interface (FUSE) on Linux, OS X, Solaris Acceptable performance for network file systems ❚ ❚ Often bound to disk anyway Page 2 Minor Gordon

  3. HIGH PERFORMANCE Overview COMPUTING 05/04/09 Implementing file systems in ❚ userspace Handling concurrency ❚ XtreemFS: an object-based ❚ distributed file system Page 3 Minor Gordon

  4. Implementing file systems in HIGH PERFORMANCE COMPUTING userspace 05/04/09 ~ VFS functions ❚ static int FUSE kernel module translates ❚ mkdir( operations to messages, writes them const char* path, on an FD mode_t mode ); FUSE userspace library reads the ❚ messages, calls the appropriate function, returns the result as a static int message DOKAN_CALLBACK ❚ Callbacks must be thread-safe, CreateDirectory( completely synchronously. LPCWSTR FileName, PDOKAN_FILE_INFO Dokan (Win32) calls can be ❚ translated, sans sharing modes. ); Page 4 Minor Gordon

  5. HIGH PERFORMANCE Abstract away COMPUTING 05/04/09 bool Volume::mkdir( Yield C++ library ❚ const YIELD::Path& path, for minimalist platform primitives, mode_t mode concurrency (next ) section), IPC { Auto-generate mrc_proxy.mkdir( ❚ client-server Path( this->name, interfaces from IDL; path ), mode ); make synchronous return true; proxy calls that do } message passing under the hood. Page 5 Minor Gordon

  6. HIGH PERFORMANCE Handling concurrency COMPUTING 05/04/09 Possible approaches: ❚ 1) Let the (multiple) FUSE threads execute all of the logic of the system  Advantages: simple at the outset  Disadvantages: have to lock around shared data structures, error prone and code becomes a mess 2) Have some sort of event loop  Advantages: obviates need for locks  Disadvantages: code becomes even uglier, even faster; hard to parallelize Page 6 Minor Gordon

  7. HIGH PERFORMANCE Stages COMPUTING 05/04/09 Decompose file system logic into stages that pass ❚ messages via queues. A stage is a unit of concurrency : two stages can ❚ FUSE always run concurrently on two different physical processors. Volume ❚ Single-threaded stages: shared data structures encapsulated by a single serializing stage – no locking Req Req ❚ Most stages should be thread-safe (otherwise MRC Proxy Amdahl's law comes into play). ❚ A stage-aware scheduler can exploit the nature of stages as well as their communications pattern (the stage graph , similar to a process interaction graph). Page 7 Minor Gordon

  8. HIGH PERFORMANCE XtreemFS COMPUTING 05/04/09 EU research project ❚ Wide-area file system with RAID, replication ❚ Aim for POSIX semantics, allow per-volume ❚ relaxation Everything in userspace ❚ ❚ Test new ideas with minimal implementation cost Goal: usable file system that performs within ❚ an order of magnitude of kernel-based network file systems Page 8 Minor Gordon

  9. HIGH PERFORMANCE XtreemFS: Features COMPUTING 05/04/09 Staged design ❚ Efficient key-value store for metadata ❚ ❚ Based on Log-Structured Merge Trees ❚ Simple implementation (~ 5k SLOC) ❚ Snapshots Striping ❚ WAN operation ❚ ❚ Distributed replicas held consistent ❚ Automatic failover ❚ Security with SSL, X.509 Page 9 Minor Gordon

  10. HIGH PERFORMANCE XtreemFS: Stages COMPUTING 05/04/09 Client FUSE ❚ Servers ❚ XtreemFS Volume ❚ Directory (DIR) ❚ Metadata Req Req catalogue (MRC) DIR Proxy MRC Proxy ❚ Object store Req (OSD) File Cache Req OSD Proxy Page 10 Minor Gordon

  11. HIGH PERFORMANCE XtreemFS: Stages cont'd COMPUTING 05/04/09 Advantages of staged design in XtreemFS: ❚ No locking around shared data structures like caches ❚ Other stages can be multithreaded to increase concurrency or offset blocking ❚ Gracefully degrade under [over]load with queue backpressure (original raison d'etre of stages in servers) ❚ Userspace scheduling ❚ Per-stage queue disciplines like SRPT ❚ Stage selection (CPU scheduling) ❚ Increase cache efficiency (Cohort scheduling, my research) Page 11 Minor Gordon

  12. HIGH PERFORMANCE XtreemFS: local reads COMPUTING 05/04/09 500000 450000 400000 350000 NFS, O_DIRECT 300000 NFS ext4, O_DIRECT ext4 250000 XtreemFS XtreemFS clienrel 200000 150000 100000 50000 0 read reread reverse read stride read random read p read Page 12 Minor Gordon

  13. HIGH PERFORMANCE XtreemFS: local writes COMPUTING 05/04/09 250000 200000 NFS, O_DIRECT 150000 NFS ext4, O_DIRECT ext4 XtreemFS XtreemFS clienrel 100000 50000 0 write rewrite random write pwrite Page 13 Minor Gordon

  14. HIGH PERFORMANCE Conclusion COMPUTING 05/04/09 Project runs until June 2010 ❚ Next release: beginning of May ❚ ❚ Re-implemented client (Linux, Win, OS X) ❚ Client-side metadata, data caching ❚ New binary protocol (based on ONC-RPC) ❚ Full SSL/X.509 support ❚ Read-only WAN replication ❚ Plugin policy modules for access control http://www.xtreemfs.org/ Page 14 Minor Gordon

  15. HIGH PERFORMANCE COMPUTING 05/04/09 Thank you for your attention. Questions? Page 15 Minor Gordon

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend