BATCHFS
Scaling the File System Control Plane with Client-Funded Metadata Servers
Qing Zheng, Kai Ren, Garth Gibson Carnegie Mellon University 9th Parallel Data Storage Workshop/SC 2014 [vision-paper]
File System Architecture APP APP APP APP Metad adat ata S a - - PowerPoint PPT Presentation
B ATCH FS Scaling the File System Control Plane with Client-Funded Metadata Servers [ vision-paper ] Qing Zheng, Kai Ren, Garth Gibson Carnegie Mellon University 9 th Parallel Data Storage Workshop/SC 2014 File System Architecture APP APP
Scaling the File System Control Plane with Client-Funded Metadata Servers
Qing Zheng, Kai Ren, Garth Gibson Carnegie Mellon University 9th Parallel Data Storage Workshop/SC 2014 [vision-paper]
Sha Shared O Object St Storage Inf Infrastructure Metad adat ata S a Service OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD APP APP APP APP APP APP APP APP
I/O operations metadata
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 2
Data path is parallel but metadata path is not necessarily.
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 3
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 4
[SC14, Tue, 2:30pm, Room 393-94-95]
Two orders of magnitude faster than Lustre/PVFS
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 6
chpfile chpfile
Batch Client
input
MPI
chpfile chpfile
Batch Client
input
MPI
chpfile chpfile
Batch Client
input
MPI
chpfile chpfile
Batch Client
input
MPI
Batch apps are self-coordinated by MPI and workflow engines
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 7
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 8
Sha Shared U Und nderlying St Stor
Inf nfrastruc ucture Bat BatchFS
mknod mkdir chmod
batch
remove mkdir chmod
batch
mknod chmod mkdir
batch
chmod mknod mkdir
batch
Batch APP Batch APP Batch APP Batch APP
From per-op to per-batch synchronization From server-side to mostly client-side processing
CLIE
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 9
Bat BatchFS is designed as an extension of Inde IndexFS
[SC14, Tue, 2:30pm, Room 393-94-95]
inheriting its metadata representation to enable high-performance metadata processing
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 11
(LSM Tree) [SSTable/LevelDB]
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 12
[(k,v), (k,v), ..., (k,v)]
mkdir
In-mem buffer
SSTable1 SSTable2 SSTable3 Key-Value Store
IndexFS Servers/Clients
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 13
File System Client IndexFS Server
traditional non-batched mkdir/chmod
SST SST1 SST SST2 SST SST3 SST SST4
Global Namespace server metadata storage
Shared Underlying Storage Infrastructure
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 14
File System Client IndexFS Server
traditional non-batched mkdir/chmod
SST SST1 SST SST2 SST SST3 SST SST4
Global Namespace
SST SST‘1 SST SST‘2
Local Lease-Protected Namespace localized/batched mkdir/chmod under a subtree server metadata storage bulk insertion
Shared Underlying Storage Infrastructure
A prototype of BatchFS as an IndexFS [SC14] feature metadata bulk insertion (batching)
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 15
8+1 No Node H HDFS FS Clus uster
Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Name Node Each node has 2 CPUs, 8GM RAM, 1 HDD SATA disk, and one 1Gb Eth port
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 16
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 17
HDFS Data Node HDFS Name Node
1 IndexFS Server 1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients DISK DISK DISK
…
8 node
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 18
HDFS Data Node HDFS Name Node
1 IndexFS Server 1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients DISK DISK DISK
…
8 node
1 IndexFS Server
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 19
HDFS Data Node HDFS Name Node
1 IndexFS Server 1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients
HDFS Data Node
1-8 IndexFS clients DISK DISK DISK
…
8 node
1 IndexFS Server 1 IndexFS Server
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 20
HDFS Data Node HDFS Name Node
1 IndexFS Server 1-8 Batch clients
HDFS Data Node
1-8 Batch clients
HDFS Data Node
1-8 Batch clients DISK DISK DISK
…
8 node
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 21
0.6 0.6 0.6 0.6 11 13 13 12 15 17 19 17 18 22 29 34 139 188 203 216 50 100 150 200 250 8 16 32 64
Throug
hput ut ( (K o
Total N al Number o r of Clie lient Pr Proc
ses
HDFS Baseline Single IndexFS Server Dual IndexFS Servers Full IndexFS Servers Client-Side Bulk Insertion
360x 360x 8-18x 8x
Lazy namespace synchronization
Pre-execute metadata ops at client-side
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 23
Batch
Client snaps pshot(…) mkdir(…) chmod(…) bulk_inser sert(…) client-local namespace
SST ST SST ST SST ST SST ST SST ST
SST SST
file system history
global namespace
SST ST SST ST SST ST
Lazy namespace synchronization
Pre-execute metadata ops at client-side
Lazy semantics enforcement
Delayed until synchronization is eventually needed
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 24
file system history
mkdir(…) chmod(…) bulk_inser sert(…)
Another
Client
Batch
Client snaps pshot(…) mkdir(…) chmod(…) SST SST
ill-formatted? permission violations? concurrent conflicts?
client-local namespace
SST ST SST ST SST ST SST ST SST ST
global namespace
SST ST SST ST SST ST
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 25
Empty subtree Exclusive access Protected by server-issued leases Lease expires Empty subtree Snapshot of a subtree Concurrent access Optimistic concurrency control No timeout Snapshot reads w/ access control
[PDSW14] [SC14]
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 26
Primary MDS Private MDS
Global Namespace Snapshot Copy Modified Namespace Unchecked Namespace Merged Namespace Client Resources Server Resources Server Resources Client Resources
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 27
Auxiliary MDS Primary MDS Private MDS
Global Namespace Snapshot Copy Modified Namespace Unchecked Namespace Accepted Namespace Merged Namespace Client Resources Server Resources Server Resources Client Resources
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 29
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 30
At least one RPC per operation Inefficient metadata representation Pessimistic concurrency control Synchronous metadata interface Dedicated authorization service
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 31
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 32
Fast P Paral allel S Storag age Infras astruc uctur ure
Primary MDS Primary MDS Primary MDS
Fixed Server Nodes
Auxiliary MDS Auxiliary MDS Auxiliary MDS Auxiliary MDS Private MDS Private MDS Private MDS Private MDS
Clie lient-Provis vision ioned M Metadata C Computin ing Nodes
BatchFS scales with the number of client nodes.
Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14) Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 33
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 36
AC ACL-spe peci cific c SST SSTables
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 37
Batch Client Batch Client Batch Client Batch Client Batch Client Batch Client Batch Client Primary MDS
Underlying Parallel File System // Access Ctrl // Quota Mng
SST SST1-qin qing SST SST1-ka kai SST SST1-ga garth
PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 38