File System Architecture APP APP APP APP Metad adat ata S a - - PowerPoint PPT Presentation

file system architecture
SMART_READER_LITE
LIVE PREVIEW

File System Architecture APP APP APP APP Metad adat ata S a - - PowerPoint PPT Presentation

B ATCH FS Scaling the File System Control Plane with Client-Funded Metadata Servers [ vision-paper ] Qing Zheng, Kai Ren, Garth Gibson Carnegie Mellon University 9 th Parallel Data Storage Workshop/SC 2014 File System Architecture APP APP


slide-1
SLIDE 1

BATCHFS

Scaling the File System Control Plane with Client-Funded Metadata Servers

Qing Zheng, Kai Ren, Garth Gibson Carnegie Mellon University 9th Parallel Data Storage Workshop/SC 2014 [vision-paper]

slide-2
SLIDE 2

File System Architecture

Sha Shared O Object St Storage Inf Infrastructure Metad adat ata S a Service OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD APP APP APP APP APP APP APP APP

I/O operations metadata

  • perations

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 2

Data path is parallel but metadata path is not necessarily.

slide-3
SLIDE 3

Reality

Data scales; META TADATA ATA is hard to scale, especially in HPC data centers Programmers like POSIX SEMANTICS, which limits linear scalability

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 3

slide-4
SLIDE 4

How We Scale the Metadata

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 4

[SC14, Tue, 2:30pm, Room 393-94-95]

Two orders of magnitude faster than Lustre/PVFS

BATCHFS Scale another order of magnitude

slide-5
SLIDE 5

BATCH

APPLICATION

slide-6
SLIDE 6

Batch Applications

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 6

chpfile chpfile

  • utput

Batch Client

input

MPI

chpfile chpfile

  • utput

Batch Client

input

MPI

chpfile chpfile

  • utput

Batch Client

input

MPI

chpfile chpfile

  • utput

Batch Client

input

MPI

Batch apps are self-coordinated by MPI and workflow engines

slide-7
SLIDE 7

Key Observation

Batch apps DON’ N’T need FS to communicate SYNCHRONOUS and SERIALIZED metadata management is OVERKILL LL for batch apps

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 7

slide-8
SLIDE 8

Introducing BatchFS

Deep batching for high throughput

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 8

Sha Shared U Und nderlying St Stor

  • rage I

Inf nfrastruc ucture Bat BatchFS

mknod mkdir chmod

batch

remove mkdir chmod

batch

mknod chmod mkdir

batch

chmod mknod mkdir

batch

Batch APP Batch APP Batch APP Batch APP

slide-9
SLIDE 9

BatchFS Philosophy

From per-op to per-batch synchronization From server-side to mostly client-side processing

 CLIE

IENT NT-FUND NDED ED metada

data a archit hitec ectur ure

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 9

slide-10
SLIDE 10

BATCHFS

BACKGROUND

slide-11
SLIDE 11

Background

Bat BatchFS is designed as an extension of Inde IndexFS

[SC14, Tue, 2:30pm, Room 393-94-95]

inheriting its metadata representation to enable high-performance metadata processing

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 11

IndexFS BatchFS

slide-12
SLIDE 12

Metadata Representation

Log-structured and indexed data structure

(LSM Tree) [SSTable/LevelDB]

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 12

[(k,v), (k,v), ..., (k,v)]

mkdir

In-mem buffer

SSTable1 SSTable2 SSTable3 Key-Value Store

IndexFS Servers/Clients

slide-13
SLIDE 13

Client-Server Interaction

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 13

File System Client IndexFS Server

traditional non-batched mkdir/chmod

SST SST1 SST SST2 SST SST3 SST SST4

Global Namespace server metadata storage

Shared Underlying Storage Infrastructure

slide-14
SLIDE 14

Metadata Bulk Insertion

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 14

File System Client IndexFS Server

traditional non-batched mkdir/chmod

SST SST1 SST SST2 SST SST3 SST SST4

Global Namespace

SST SST‘1 SST SST‘2

Local Lease-Protected Namespace localized/batched mkdir/chmod under a subtree server metadata storage bulk insertion

Shared Underlying Storage Infrastructure

slide-15
SLIDE 15

Preliminary Results

A prototype of BatchFS as an IndexFS [SC14] feature metadata bulk insertion (batching)

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 15

8+1 No Node H HDFS FS Clus uster

Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Name Node Each node has 2 CPUs, 8GM RAM, 1 HDD SATA disk, and one 1Gb Eth port

slide-16
SLIDE 16

Workload

Each client process creates 1 private directory (8-64 client processes on fixed 8 nodes) Clients insert empty files into their own directories (in total 1 million * #servers files)

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 16

slide-17
SLIDE 17

E xperiment Setup

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 17

HDFS Data Node HDFS Name Node

1 IndexFS Server 1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients DISK DISK DISK

8 node

slide-18
SLIDE 18

E xperiment Setup

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 18

HDFS Data Node HDFS Name Node

1 IndexFS Server 1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients DISK DISK DISK

8 node

1 IndexFS Server

slide-19
SLIDE 19

E xperiment Setup

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 19

HDFS Data Node HDFS Name Node

1 IndexFS Server 1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients

HDFS Data Node

1-8 IndexFS clients DISK DISK DISK

8 node

1 IndexFS Server 1 IndexFS Server

slide-20
SLIDE 20

E xperiment Setup

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 20

HDFS Data Node HDFS Name Node

1 IndexFS Server 1-8 Batch clients

HDFS Data Node

1-8 Batch clients

HDFS Data Node

1-8 Batch clients DISK DISK DISK

8 node

slide-21
SLIDE 21

8x-360x Perf. Improvements

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 21

0.6 0.6 0.6 0.6 11 13 13 12 15 17 19 17 18 22 29 34 139 188 203 216 50 100 150 200 250 8 16 32 64

Throug

  • ughp

hput ut ( (K o

  • p/s)

Total N al Number o r of Clie lient Pr Proc

  • cesse

ses

HDFS Baseline Single IndexFS Server Dual IndexFS Servers Full IndexFS Servers Client-Side Bulk Insertion

360x 360x 8-18x 8x

slide-22
SLIDE 22

BATCHFS

DE SIGN

slide-23
SLIDE 23

Deep Metadata Batch

Lazy namespace synchronization

Pre-execute metadata ops at client-side

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 23

Batch

Client snaps pshot(…) mkdir(…) chmod(…) bulk_inser sert(…) client-local namespace

SST ST SST ST SST ST SST ST SST ST

SST SST

file system history

global namespace

SST ST SST ST SST ST

slide-24
SLIDE 24

Deep Metadata Batch

Lazy namespace synchronization

Pre-execute metadata ops at client-side

Lazy semantics enforcement

Delayed until synchronization is eventually needed

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 24

file system history

mkdir(…) chmod(…) bulk_inser sert(…)

Another

Client

Batch

Client snaps pshot(…) mkdir(…) chmod(…) SST SST

ill-formatted? permission violations? concurrent conflicts?

client-local namespace

SST ST SST ST SST ST SST ST SST ST

global namespace

SST ST SST ST SST ST

slide-25
SLIDE 25

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 25

Empty subtree Exclusive access Protected by server-issued leases Lease expires Empty subtree Snapshot of a subtree Concurrent access Optimistic concurrency control No timeout Snapshot reads w/ access control

BATCHFS

[PDSW14] [SC14]

slide-26
SLIDE 26

Cliend-Funded Metadata Processing

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 26

Primary MDS Private MDS

Global Namespace Snapshot Copy Modified Namespace Unchecked Namespace Merged Namespace Client Resources Server Resources Server Resources Client Resources

slide-27
SLIDE 27

Cliend-Funded Metadata Verification

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 27

Auxiliary MDS Primary MDS Private MDS

Global Namespace Snapshot Copy Modified Namespace Unchecked Namespace Accepted Namespace Merged Namespace Client Resources Server Resources Server Resources Client Resources

slide-28
SLIDE 28

FUTURE

WORK

slide-29
SLIDE 29

Conflict Resolution

Who is responsible? What’s the semantics? A) DB-like, read/write sets, transactional B) Bayou-like, auto resolution, domain rules C) Coda-like, resolved by human

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 29

slide-30
SLIDE 30

Self-probable Metadata

For clients to generate proofs of the correctness

  • f their namespace mutations

A) operation log (possibly compressed) B) logic-based proof (proof-carrying-code)

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 30

slide-31
SLIDE 31

Conclusion

At least one RPC per operation Inefficient metadata representation Pessimistic concurrency control Synchronous metadata interface Dedicated authorization service

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 31

BATCHFS

slide-32
SLIDE 32

BatchFS Architecture

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 32

Fast P Paral allel S Storag age Infras astruc uctur ure

Primary MDS Primary MDS Primary MDS

Fixed Server Nodes

Auxiliary MDS Auxiliary MDS Auxiliary MDS Auxiliary MDS Private MDS Private MDS Private MDS Private MDS

Clie lient-Provis vision ioned M Metadata C Computin ing Nodes

BatchFS scales with the number of client nodes.

slide-33
SLIDE 33

Reference

Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14) Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 33

BATCHFS

slide-34
SLIDE 34

QUE STIONS

IDE AS AND FE E DBACK



slide-35
SLIDE 35

BACKUP SLIDE S

slide-36
SLIDE 36

Access & Quota Control

Access enforced by OSD No quota control for metadata Quota control on data provided by OSD

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 36

slide-37
SLIDE 37

AC ACL-spe peci cific c SST SSTables

Snapshot Access Control

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 37

Batch Client Batch Client Batch Client Batch Client Batch Client Batch Client Batch Client Primary MDS

Underlying Parallel File System // Access Ctrl // Quota Mng

SST SST1-qin qing SST SST1-ka kai SST SST1-ga garth

slide-38
SLIDE 38

Trustworthiness

How to trust auxiliary metadata servers? A) just trust these servers B) trusted VM running the right code C) hardware co-processors such as TPM

PDSW14 Parallel Data Lab - http://www.pdl.cmu.edu/ 38