envisioning a parallel file system without dedicated
play

Envisioning a Parallel File System without Dedicated Metadata - PowerPoint PPT Presentation

Whats Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory Scaling


  1. What’s Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory

  2. Scaling needs decoupling • NASD [ asplos98 ] 5,000 IndexFS_Lustre (32 clients run IndexFS) Lustre (single server, 32 clients) Throughput (Kop/s) o decoupling data from 500 metadata 30x 30x o Lustre, Google FS, etc 300x 300x 50 100x 100x • IndexFS [ sc14 ] o dynamically partitioned 5 metadata middleware o orders of magnitude faster 0 than Lustre in metadata empty file creation file lookup file deletion Exa- scaling demands ever more decoupling Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 2

  3. Compute-side server code 25,000 • BatchFS [ pdsw14 ] 16 servers, 64 clients 19,692 File Creates (Kop/s) 20,000 o decoupling clients from servers 15,000 o temporarily scale beyond 30X 30X the total number of servers 10,000 o very fast for a while and eventually clients 5,000 618 communicate with servers 0 to merge updates IndexFS BatchFS How much further can we delay & decouple merging ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 3

  4. ∆FS Goal • Want the peak Tput BatchFS demonstrated • Compel freedom from server synchronization o by eliminating all server machines o by dealing with issues rising from the absence of metadata servers o by not assuming an underlying PFS Scale beyond BatchFS Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 4

  5. Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 5

  6. Middleware Design ∆FS is middleware spawned by each parallel app App P1 P2 P3 Pn … App App ∆FS App App obje bject ct st stor ore e st stor orin ing dat g data/ a/metada metadata ta Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 6

  7. ∆FS Overview FS defined by a set of snapshots stored as sets of FS snapshot napshot metadata logs and data objects / e / b c / d a list a l ist of of metad adata ata op ops b c b e Logic Lo ical al Vi View Ob Obje ject ct St Stor orag age Log Log Log Log Note: data objects not shown here rena ename me /d->/e >/e rmdi dir /c /c Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 7

  8. System Model Reads input dataset from an existing FS snapshot Creates a new snapshot with output data inserted in input ut sn snapshot apshot a ne new sn snapshot apshot ready dy to be us used by fu futur ure apps produce duced by a previous vious app input create Lo Logi gica cal l Vi View App Ob Obje ject ct St Stor orag age Log Log Log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 8

  9. Key take-away • NO global namespace Each namespace is defined by the app and the logs loaded by it • NO false sharing Apps don’t access logs not needed by them • NO dedicated metadata servers App directly communicates with the storage to load/dump metadata logs Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 9

  10. How logs are implemented ? • TableFS [ atc13 ] o namespace = a large dir entry table + embedded inodes • implemented as LS LSM-Tree ee (a collection of ordered B-Trees) • Each log object is a differential B-Tree (diff) o representing a set of recent updates (e.g. newly inserted/modified inodes) k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Log k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 10

  11. Why LSM-Tree is a good idea ? • Logs are 1 st – class data No need to replay logs to recover namespaces Near-zero cost of merging namespaces • Each log is self-indexed Scanning/reading within a single log is fast: O(logN) Scanning/reading a series of non-overlapping logs is as fast as a single log Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 11

  12. Agenda • DeltaFS design • Why no dedicated servers is not a problem Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 12

  13. P1: Do my apps need the FS to communicate/synchronize ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 13

  14. Unrelated Apps W ork on different datasets and don’t communicate. / climate ocean App 2 App 1 pacific atlantic Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 14

  15. Self-Coordinating Apps Use middleware to share faster & more efficiently Parallel Scientific App MPI MPI P1 P2 P3 File Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 15

  16. Workflow Apps Externally coordinated by job schedulers / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Don’t need the FS to communicate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 16

  17. Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 17

  18. Anonymous Synchronization e.g. Two app instances competing for mastership App 2 App 2 App 1 App 1 Lustre Zookeeper (ZAB), .LOCK .LOCK Paxos, Raft Turn to a mechanism outside the FS to coordinate Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 18

  19. P2: But I often use different programs to access data concurrently ! Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 19

  20. User requested concurrent sharing Mon App attach ch ∆FS P1 P2 P3 Pn … Viz ∆FS ∆FS attach ch Link to ∆ FS middleware and attach to the primary parallel app Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 20

  21. P3: Which snapshots to use ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 21

  22. Which snapshots to use ? Option 1: rely on job schedulers to automate namespace propagation input=… job scheduler App output=… workflow engine input=… App_1 App output=… App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 22

  23. Which snapshots to use ? Option 2: ask external registries using search predicates pub ublish lish 1 App snapshot registry se sear arch 2 App_1 App coll llect 3 App_2 App Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 23

  24. Finding snapshots is like searching a page using Google • Possible search predicates o find latest stable science code for my science o find latest recommended mesh model and cleaned input data o find latest vendor recommended HW libraries • Also, there can be multiple snapshot registries Allows programmable namespace composition Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 24

  25. P4: What about potential conflicts among different snapshots ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 25

  26. Unrelated Apps Work on different portions of the namespace / climate ocean App 2 App 1 pacific atlantic Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 26

  27. Workflow Apps Access the same dataset at different time / login_log user_profile job scheduler workflow engine Reducer Mapper movie_profile Iter3 Iter4 Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 27

  28. Self-Coordinating Apps Coded to be conflict-free Parallel Scientific App MPI MPI P1 P2 P3 File Won’t generate any conflicts Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 28

  29. Namespace composition is fast if there is no conflict • Recall: near-zero cost of merging logs o better if those logs do not overlap with each other What if there are conflicts ? Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 29

  30. Use domain knowledge /de deltaf tafs Conflicts resolved per app’s own file_1 file_2 reconciliation policy /de deltaf tafs /de deltaf tafs /de deltaf tafs file_1 file_1 file_2(b) file_2 file_2 file_1(a) file_2(a) file_1(b) input snapshot input snapshot possible resolution outcome Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 30

  31. Use curators to remember conflict resolution results App So no duplicated resolutions by different apps a cur urato ator r in inhe herit rits s a pre- reso solve lved d na namespace space fr from om an n app anothe ano ther nam namespac space e curat cu ator or App App an n app dir irectly tly takes es na namespa spaces s a namesp a nam espace ace fr from 2 cu curato ators rs curat cu ator or Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 31

  32. Conclusion • Strong scalability needs strong decoupling o exiting clients synch too often with servers o removing servers force us to rethink on what is necessary o need to try radically different model for shared storage Parallel Data Lab - http://www.pdl.cmu.edu/ PDSW 2015 Page 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend