What’s Beyond IndexFS & BatchFS
Envisioning a Parallel File System without Dedicated Metadata Servers
Qing Zheng
Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider
Carnegie Mellon University Los Alamos National Laboratory
Envisioning a Parallel File System without Dedicated Metadata - - PowerPoint PPT Presentation
Whats Beyond IndexFS & BatchFS Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider Carnegie Mellon University Los Alamos National Laboratory Scaling
What’s Beyond IndexFS & BatchFS
Qing Zheng
Kai Ren, Garth Gibson, Bradley W. Settlemyer, Gary Grider
Carnegie Mellon University Los Alamos National Laboratory
metadata
metadata middleware
than Lustre in metadata
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 2
5 50 500 5,000 empty file creation file lookup file deletion
Throughput (Kop/s)
IndexFS_Lustre (32 clients run IndexFS) Lustre (single server, 32 clients)
Exa- scaling demands ever more decoupling
100x 100x 30x 30x 300x 300x
servers
the total number of servers
eventually clients communicate with servers to merge updates
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 3
618 19,692
5,000 10,000 15,000 20,000 25,000 IndexFS BatchFS
File Creates (Kop/s)
How much further can we delay & decouple merging ?
30X 30X
16 servers, 64 clients
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 4
Scale beyond BatchFS
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 5
App App
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 6
App
…
P1 Pn P2 P3
∆FS
bject ct st stor
e st stor
ing dat g data/ a/metada metadata ta
App App
FS defined by a set of snapshots stored as sets of metadata logs and data objects
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 7
Ob Obje ject ct St Stor
age Lo Logic ical al Vi View Log Log Log Log
FS snapshot napshot a l a list ist of
adata ata op
Note: data objects not shown here
/ b c e b c d / b / e
rena ename me /d->/e >/e rmdi dir /c /c
Reads input dataset from an existing FS snapshot Creates a new snapshot with output data inserted
a ne new sn snapshot apshot ready dy to be us used by fu futur ure apps in input ut sn snapshot apshot produce duced by a previous vious app input create
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 8
Log Log Ob Obje ject ct St Stor
age Lo Logi gica cal l Vi View Log App
Each namespace is defined by the app and the logs loaded by it
Apps don’t access logs not needed by them
App directly communicates with the storage to load/dump metadata logs
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 9
LSM-Tree ee (a collection of ordered B-Trees)
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 10
Log
k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v k/v
No need to replay logs to recover namespaces Near-zero cost of merging namespaces
Scanning/reading within a single log is fast: O(logN) Scanning/reading a series of non-overlapping logs is as fast as a single log
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 11
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 12
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 13
pacific climate
atlantic
/ App1 App2
Don’t need the FS to communicate
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 14
Don’t need the FS to communicate
P1 P2 P3
MPI MPI
Parallel Scientific App
File
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 15
/
user_profile movie_profile login_log
Reducer Mapper
Iter3 Iter4
Don’t need the FS to communicate
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 16
job scheduler workflow engine
Turn to a mechanism outside the FS to coordinate
App1 App2
Lustre .LOCK
App1 App2
Zookeeper (ZAB), Paxos, Raft .LOCK
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 17
Turn to a mechanism outside the FS to coordinate
App1 App2
Lustre .LOCK
App1 App2
Zookeeper (ZAB), Paxos, Raft .LOCK
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 18
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 19
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 20
App
…
P1 Pn P2 P3
∆FS
Mon
∆FS
Viz
∆FS attach ch attach ch
Link to ∆FS middleware and attach to the primary parallel app
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 21
Option 1: rely on job schedulers to automate namespace propagation
App_2 App App
job scheduler workflow engine input=…
App_1 App
input=…
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 22
Option 2: ask external registries using search predicates
App_2 App App App_1 App
snapshot registry coll llect pub ublish lish
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 23
se sear arch 1 2 3
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 24
Allows programmable namespace composition
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 25
pacific climate
atlantic
/ App1 App2
Won’t generate any conflicts
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 26
/
user_profile movie_profile login_log
Reducer Mapper
Iter3 Iter4
Won’t generate any conflicts
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 27
job scheduler workflow engine
Won’t generate any conflicts
P1 P2 P3
MPI MPI
Parallel Scientific App
File
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 28
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 29
What if there are conflicts ?
Conflicts resolved per app’s own reconciliation policy
file_1
/de deltaf tafs
file_2 file_1
/de deltaf tafs
file_2 input snapshot input snapshot
file_1
/de deltaf tafs
file_2
/de deltaf tafs
file_1(b) file_1(a) file_2(a) file_2(b)
possible resolution outcome
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 30
So no duplicated resolutions by different apps
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 31
App
a nam a namesp espace ace cu curat ator
a cur urato ator r in inhe herit rits s a pre- reso solve lved d na namespace space fr from
an n app
ano anothe ther nam namespac space e cu curat ator
App
an n app dir irectly tly takes es na namespa spaces s fr from 2 cu curato ators rs
App
PDSW 2015 Parallel Data Lab - http://www.pdl.cmu.edu/ Page 32