XtreemFS: high- performance network file system clients and servers - - PowerPoint PPT Presentation

xtreemfs high performance network file system clients and
SMART_READER_LITE
LIVE PREVIEW

XtreemFS: high- performance network file system clients and servers - - PowerPoint PPT Presentation

XtreemFS: high- performance network file system clients and servers in userspace Minor Gordon, NEC Deutschland GmbH mgordon@hpce.nec.com HIGH PERFORMANCE Why userspace? COMPUTING 05/04/09 File systems traditionally implemented in the


slide-1
SLIDE 1

XtreemFS: high- performance network file system clients and servers in userspace

Minor Gordon, NEC Deutschland GmbH mgordon@hpce.nec.com

slide-2
SLIDE 2

Page 2

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Why userspace?

File systems traditionally implemented in the kernel for performance, control

Some advantages doing things in userspace:

❚ High-level languages: Python, Ruby, et al. for prototyping, then

C++ ( → tool support, reduced code footprint, etc.)

❚ Protection: kernel-userspace bridges (Dokan, FUSE) are fairly

stable, file system can crash without requiring a reboot

❚ Porting: one common kernel->userspace upcall interface

(FUSE) on Linux, OS X, Solaris

Acceptable performance for network file systems

❚ Often bound to disk anyway

slide-3
SLIDE 3

Page 3

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Overview

Implementing file systems in userspace

Handling concurrency

XtreemFS: an object-based distributed file system

slide-4
SLIDE 4

Page 4

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Implementing file systems in userspace

static int mkdir( const char* path, mode_t mode ); static int DOKAN_CALLBACK CreateDirectory( LPCWSTR FileName, PDOKAN_FILE_INFO );

~ VFS functions

FUSE kernel module translates

  • perations to messages, writes them
  • n an FD

FUSE userspace library reads the messages, calls the appropriate function, returns the result as a message

❚ Callbacks must be thread-safe,

completely synchronously.

Dokan (Win32) calls can be translated, sans sharing modes.

slide-5
SLIDE 5

Page 5

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Abstract away

bool Volume::mkdir( const YIELD::Path& path, mode_t mode ) { mrc_proxy.mkdir( Path( this->name, path ), mode ); return true; }

Yield C++ library for minimalist platform primitives, concurrency (next section), IPC

Auto-generate client-server interfaces from IDL; make synchronous proxy calls that do message passing under the hood.

slide-6
SLIDE 6

Page 6

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Handling concurrency

Possible approaches: 1) Let the (multiple) FUSE threads execute all of the logic

  • f the system

 Advantages: simple at the outset  Disadvantages: have to lock around shared data

structures, error prone and code becomes a mess 2) Have some sort of event loop

 Advantages: obviates need for locks  Disadvantages: code becomes even uglier, even

faster; hard to parallelize

slide-7
SLIDE 7

Page 7

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Stages

Decompose file system logic into stages that pass messages via queues.

A stage is a unit of concurrency: two stages can always run concurrently on two different physical processors.

❚ Single-threaded stages: shared data structures

encapsulated by a single serializing stage – no locking

❚ Most stages should be thread-safe (otherwise

Amdahl's law comes into play).

❚ A stage-aware scheduler can exploit the nature

  • f stages as well as their communications

pattern (the stage graph, similar to a process interaction graph).

FUSE Volume

Req

MRC Proxy

Req

slide-8
SLIDE 8

Page 8

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS

EU research project

Wide-area file system with RAID, replication

Aim for POSIX semantics, allow per-volume relaxation

Everything in userspace

❚ Test new ideas with minimal

implementation cost

Goal: usable file system that performs within an order of magnitude of kernel-based network file systems

slide-9
SLIDE 9

Page 9

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS: Features

Staged design

Efficient key-value store for metadata

❚ Based on Log-Structured Merge Trees ❚ Simple implementation (~ 5k SLOC) ❚ Snapshots

Striping

WAN operation

❚ Distributed replicas held consistent ❚ Automatic failover ❚ Security with SSL, X.509

slide-10
SLIDE 10

Page 10

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS: Stages

FUSE DIR Proxy XtreemFS Volume

Req

MRC Proxy

Req

File Cache

Req

OSD Proxy

Req

Client

Servers

❚ Directory (DIR) ❚ Metadata

catalogue (MRC)

❚ Object store

(OSD)

slide-11
SLIDE 11

Page 11

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS: Stages cont'd

Advantages of staged design in XtreemFS:

❚ No locking around shared data structures like caches ❚ Other stages can be multithreaded to increase concurrency or

  • ffset blocking

❚ Gracefully degrade under [over]load with queue backpressure

(original raison d'etre of stages in servers)

❚ Userspace scheduling ❚ Per-stage queue disciplines like SRPT ❚ Stage selection (CPU scheduling) ❚ Increase cache efficiency (Cohort scheduling, my research)

slide-12
SLIDE 12

Page 12

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS: local reads

read reread reverse read stride read random read p read 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 NFS, O_DIRECT NFS ext4, O_DIRECT ext4 XtreemFS XtreemFS clienrel

slide-13
SLIDE 13

Page 13

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

XtreemFS: local writes

write rewrite random write pwrite 50000 100000 150000 200000 250000 NFS, O_DIRECT NFS ext4, O_DIRECT ext4 XtreemFS XtreemFS clienrel

slide-14
SLIDE 14

Page 14

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Conclusion

Project runs until June 2010

Next release: beginning of May

❚ Re-implemented client (Linux, Win, OS X) ❚ Client-side metadata, data caching ❚ New binary protocol (based on ONC-RPC) ❚ Full SSL/X.509 support ❚ Read-only WAN replication ❚ Plugin policy modules for access control

http://www.xtreemfs.org/

slide-15
SLIDE 15

Page 15

HIGH PERFORMANCE COMPUTING 05/04/09

Minor Gordon

Thank you for your attention. Questions?