Envisioning a Parallel File System without Dedicated Metadata - - PowerPoint PPT Presentation

envisioning a parallel file system without dedicated
SMART_READER_LITE
LIVE PREVIEW

Envisioning a Parallel File System without Dedicated Metadata - - PowerPoint PPT Presentation

Whats Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory Scaling


slide-1
SLIDE 1

What’s Beyond IndexFS & BatchFS

Envisioning a Parallel File System without Dedicated Metadata Servers

Qing Zheng

Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider

Carnegie Mellon University Los Alamos National Laboratory

slide-2
SLIDE 2

Scaling needs decoupling

  • NASD [asplos98]
  • decoupling data from

metadata

  • Lustre, Google FS, etc
  • IndexFS [sc14]
  • dynamically partitioned

metadata middleware

  • orders of magnitude faster

than Lustre in metadata

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 2

5 50 500 5,000 empty file creation file lookup file deletion

Throughput (Kop/s)

IndexFS_Lustre (32 clients run IndexFS) Lustre (single server, 32 clients)

Exa- scaling demands ever more decoupling

100x 100x 30x 30x 300x 300x

slide-3
SLIDE 3

Compute-side server code

  • BatchFS [pdsw14]
  • decoupling clients from

servers

  • temporarily scale beyond

the total number of servers

  • very fast for a while and

eventually clients communicate with servers to merge updates

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 3

618 19,692

5,000 10,000 15,000 20,000 25,000 IndexFS BatchFS

File Creates (Kop/s)

How much further can we delay & decouple merging ?

30X 30X

16 servers, 64 clients

slide-4
SLIDE 4

∆FS Goal

  • Want the peak Tput BatchFS demonstrated
  • Compel freedom from server synchronization
  • by eliminating all server machines
  • by dealing with issues rising from the absence of metadata servers
  • by not assuming an underlying PFS

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 4

Scale beyond BatchFS

slide-5
SLIDE 5

Agenda

  • DeltaFS design
  • Why no dedicated servers is not a problem

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 5

slide-6
SLIDE 6

App App

Middleware Design

∆FS is middleware spawned by each parallel app

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 6

App

P1 Pn P2 P3

∆FS

  • bje

bject ct st stor

  • re

e st stor

  • rin

ing dat g data/ a/metada metadata ta

App App

slide-7
SLIDE 7

∆FS Overview

FS defined by a set of snapshots stored as sets of metadata logs and data objects

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 7

Ob Obje ject ct St Stor

  • rag

age Lo Logic ical al Vi View Log Log Log Log

FS snapshot napshot a l a list ist of

  • f metad

adata ata op

  • ps

Note: data objects not shown here

/ b c e b c d / b / e

rena ename me /d->/e >/e rmdi dir /c /c

slide-8
SLIDE 8

System Model

Reads input dataset from an existing FS snapshot Creates a new snapshot with output data inserted

a ne new sn snapshot apshot ready dy to be us used by fu futur ure apps in input ut sn snapshot apshot produce duced by a previous vious app input create

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 8

Log Log Ob Obje ject ct St Stor

  • rag

age Lo Logi gica cal l Vi View Log App

slide-9
SLIDE 9

Key take-away

  • NO global namespace

Each namespace is defined by the app and the logs loaded by it

  • NO false sharing

Apps don’t access logs not needed by them

  • NO dedicated metadata servers

App directly communicates with the storage to load/dump metadata logs

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 9

slide-10
SLIDE 10

How logs are implemented ?

  • TableFS [atc13]
  • namespace = a large dir entry table + embedded inodes
  • implemented as LS

LSM-Tree ee (a collection of ordered B-Trees)

  • Each log object is a differential B-Tree (diff)
  • representing a set of recent updates (e.g. newly inserted/modified inodes)

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 10

Log

k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v

slide-11
SLIDE 11

Why LSM-Tree is a good idea ?

  • Logs are 1st–class data

No need to replay logs to recover namespaces Near-zero cost of merging namespaces

  • Each log is self-indexed

Scanning/reading within a single log is fast: O(logN) Scanning/reading a series of non-overlapping logs is as fast as a single log

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 11

slide-12
SLIDE 12

Agenda

  • DeltaFS design
  • Why no dedicated servers is not a problem

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 12

slide-13
SLIDE 13

P1: Do my apps need the FS to communicate/synchronize ?

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 13

slide-14
SLIDE 14

Unrelated Apps

Work on different datasets and don’t communicate.

pacific climate

  • cean

atlantic

/ App1 App2

Don’t need the FS to communicate

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 14

slide-15
SLIDE 15

Self-Coordinating Apps

Use middleware to share faster & more efficiently

Don’t need the FS to communicate

P1 P2 P3

MPI MPI

Parallel Scientific App

File

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 15

slide-16
SLIDE 16

Workflow Apps

Externally coordinated by job schedulers

/

user_profile movie_profile login_log

Reducer Mapper

Iter3 Iter4

Don’t need the FS to communicate

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 16

job scheduler workflow engine

slide-17
SLIDE 17

Anonymous Synchronization

e.g. Two app instances competing for mastership

Turn to a mechanism outside the FS to coordinate

App1 App2

Lustre .LOCK

App1 App2

Zookeeper (ZAB), Paxos, Raft .LOCK

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 17

slide-18
SLIDE 18

Anonymous Synchronization

e.g. Two app instances competing for mastership

Turn to a mechanism outside the FS to coordinate

App1 App2

Lustre .LOCK

App1 App2

Zookeeper (ZAB), Paxos, Raft .LOCK

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 18

slide-19
SLIDE 19

P2: But I often use different programs to access data concurrently !

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 19

slide-20
SLIDE 20

User requested concurrent sharing

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 20

App

P1 Pn P2 P3

∆FS

Mon

∆FS

Viz

∆FS attach ch attach ch

Link to ∆FS middleware and attach to the primary parallel app

slide-21
SLIDE 21

P3: Which snapshots to use ?

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 21

slide-22
SLIDE 22

Option 1: rely on job schedulers to automate namespace propagation

Which snapshots to use ?

App_2 App App

job scheduler workflow engine input=…

  • utput=…

App_1 App

input=…

  • utput=…

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 22

slide-23
SLIDE 23

Which snapshots to use ?

Option 2: ask external registries using search predicates

App_2 App App App_1 App

snapshot registry coll llect pub ublish lish

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 23

se sear arch 1 2 3

slide-24
SLIDE 24

Finding snapshots is like searching a page using Google

  • Possible search predicates
  • find latest stable science code for my science
  • find latest recommended mesh model and cleaned input data
  • find latest vendor recommended HW libraries
  • Also, there can be multiple snapshot registries

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 24

Allows programmable namespace composition

slide-25
SLIDE 25

P4: What about potential conflicts among different snapshots ?

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 25

slide-26
SLIDE 26

Unrelated Apps

Work on different portions of the namespace

pacific climate

  • cean

atlantic

/ App1 App2

Won’t generate any conflicts

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 26

slide-27
SLIDE 27

Workflow Apps

Access the same dataset at different time

/

user_profile movie_profile login_log

Reducer Mapper

Iter3 Iter4

Won’t generate any conflicts

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 27

job scheduler workflow engine

slide-28
SLIDE 28

Self-Coordinating Apps

Coded to be conflict-free

Won’t generate any conflicts

P1 P2 P3

MPI MPI

Parallel Scientific App

File

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 28

slide-29
SLIDE 29

Namespace composition is fast if there is no conflict

  • Recall: near-zero cost of merging logs
  • better if those logs do not overlap with each other

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 29

What if there are conflicts ?

slide-30
SLIDE 30

Use domain knowledge

Conflicts resolved per app’s own reconciliation policy

file_1

/de deltaf tafs

file_2 file_1

/de deltaf tafs

file_2 input snapshot input snapshot

file_1

/de deltaf tafs

file_2

/de deltaf tafs

file_1(b) file_1(a) file_2(a) file_2(b)

possible resolution outcome

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 30

slide-31
SLIDE 31

Use curators to remember conflict resolution results

So no duplicated resolutions by different apps

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 31

App

a nam a namesp espace ace cu curat ator

  • r

a cur urato ator r in inhe herit rits s a pre- reso solve lved d na namespace space fr from

  • m

an n app

ano anothe ther nam namespac space e cu curat ator

  • r

App

an n app dir irectly tly takes es na namespa spaces s fr from 2 cu curato ators rs

App

slide-32
SLIDE 32

Conclusion

  • Strong scalability needs strong decoupling
  • exiting clients synch too often with servers
  • removing servers force us to rethink on what is necessary
  • need to try radically different model for shared storage

PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 32