Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth - - PowerPoint PPT Presentation

▶

Dec 18, 2022 169 likes •400 views

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w apps & backing storage

SLIDE 1

Structuring PLFS for

Extensibility

Chuck Cranor, Milo Polte, Garth Gibson

PARALLEL DATA LABORATORY

Carnegie Mellon University

SLIDE 2

Parallel Log Structured File System

– Interposed filesystem b/w apps & backing storage – Los Alamos National Labs, CMU, EMC, … – Target: HPC checkpoint files

PLFS transparently transforms a highly

concurrent write access pattern to a pattern more efficient for distributed filesystems

– First paper: Bent et al, Supercomputer 2009 – http://github.com/plfs, http://institute.lanl.gov/plfs/

What is PLFS?

SLIDE 3

The two main checkpoint write patterns:

– N-1: all N processes write to one shared file

Concurrent I/O to a single file is often unscalable
Small, unaligned, clustered traffic is problematic

– N-N: each process writes to its own file

Overhead of inserting many files in a single dir
Easier for DFS (after files created)
Archival and management more difficult
Initial PLFS focus: improve N-1 case

Checkpoint Write Patterns

SLIDE 4

PLFS improves N-1 performance by

transforming it into an N-N workload

FUSE/MPI: transparent solution,

no application changes required

PLFS Transforms Workloads

SLIDE 5

PLFS Converts N-1 to N-N

PLFS Virtual Layer /foo host1 host2 host3 /foo/ hostdir.1/ hostdir.2/ hostdir.3/ 131 132 279 281 152 148 data.131 indx.131 data.132 indx.132 data.279 indx.279 data.281 indx.281 data.152 indx.152 data.148 indx.148 Physical Underlying Parallel File System

SLIDE 6

PLFS N-1 Bandwidth Speedups

100X 10X

SPEED UP

SLIDE 7

Original PLFS was limited to 1 workload:

– N-1 checkpoint on mounted posix filesystem – All data stored in PLFS container logs

Ported first to MIO-IO/ROMIO

– Feasibly deploy on leadership class machines

Success with LANL apps: actual adoption?

– Requires maintainability & roadmap evolution – Develop a team: LANL, EMC, CMU, …

Revisit code with maintainability in mind

The Price of Success

SLIDE 8

PLFS Extensibility Architecture

PLFS high-level API MDHIM w/LevelDB HPC Application Logical FS interface container small file flat file Index API distributed pattern byte-range I/O Store interface posix pvfs iofsl hdfs libhdfs/jvm hdfs.jar libplfs

SLIDE 9

Emergence of Hadoop: converged storage
HDFS: Hadoop Distributed Filesystem

– Key attributes:

Single sequential writer (not POSIX, no pwrite)
Not VFS mounted, access through Java API
Local storage on nodes (converged)
Data replicated ~3 times (local+remote1+remote2)
HPC in the Cloud: N-1 checkpoint on HDFS?

– Observation: PLFS log I/O fits HDFS semantics

Case Study: HPC in the Cloud

SLIDE 10

PLFS hardwired to POSIX API:

– Needs a kernel mounted filesystem – Uses integer file descriptors – Memory maps index files to read them

HDFS does not fit these assumptions
Solution: I/O Store

– Insert a layer of indirection above PLFS backend – Model after POSIX API to minimize code changes

PLFS Backend Limitations

SLIDE 11

PLFS I/O Store Architecture

lib{hdfs,jvm} hdfs.jar PLFS FUSE PLFS MPI I/O posix libc API PLFS container I/O store posix i/o HDFS i/o libplfs mounted fs Java code

SLIDE 12

Testbed: PRObE (www.nmc-probe.org)
Each node has dual 1.6GHz AMD cores,

16GB RAM, 1TB drive, gigabit ethernet

Ubuntu Linux, HDFS 0.21.0, PLFS, OpenMPI
Benchmark: LANL FS Test Suite (fs_test)
Simulates N-1 checkpoint, strided
Filesystems tested:

– PVFS OrangeFS 2.8.4 w/64MB stripe size – PLFS/HDFS w/1 replica (local disk) – PLFS/HDFS w/3 replicas (local disk + remote1 + remote 2)

Blocksizes: 47001, 48K, 1M
Checkpoint size: 32GB written by 64 nodes

PLFS/HDFS Benchmark

SLIDE 13

Benchmark Operation

3 stride block remaining strides continue pattern for write phase read phase nodes nodes (shifted for read) 2 3 1 1 2

We unmount and cache flush data filesystem between read/write

SLIDE 14

FUSE filesystem and a Middleware lib (MPI)

PLFS Implementation Architecture

PLFS FUSE daemon PLFS FUSE app proc1 PLFS FUSE app proc2 PLFS MPI app proc1 PLFS MPI app proc2

FUSE module VFS/POSIX API

interconnect

Local fs

Distributed fs

PLFS lib

PLFS/ MPI libs PLFS/ MPI libs

use r kernel

app i/o FUSE upcall backing store i/o MPI sync calls to disk to network to other nodes

SLIDE 15

47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

PLFS/HDFS Write Bandwidth

SLIDE 16

47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

PLFS/HDFS Write Bandwidth

PLFS/HDFS performs well (note HDFS1 is local disk)

SLIDE 17

47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

PLFS/HDFS Write Bandwidth

PLFS/HDFS performs well (note HDFS3 is 3 copies)

SLIDE 18

47001

48K 1M access unit size (bytes) 500 1000 read bandwidth (Mbytes/s) PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read

PLFS/HDFS Read Bandwidth

HDFS with small access size benefits from PLFS log grouping

SLIDE 19

47001

48K 1M access unit size (bytes) 500 1000 read bandwidth (Mbytes/s) PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read

PLFS/HDFS Read Bandwidth

HDFS3 with large access size suffers imbalance

SLIDE 20

20 30 40 50 60 Node number 500 1000 Total size of data served (MB) PLFS/HDFS1 PLFS/HDFS3

HDFS 1 vs 3: I/O Scheduling

Network counters show HDFS3 read imbalance

SLIDE 21

Rewrote initial I/O Store prototype

– Production-level code – Multiple concurrent instances of I/O Stores

Re-plumbed entire backend I/O path
Prototyped POSIX, HDFS, PVFS stores

– IOFSL done by EMC

Regression tested at LANL
I/O Store now part of PLFS released code

– https://github.com/PLFS

I/O Store Status

SLIDE 22

PLFS extensions for workload transformation:

– Logical FS interface

Not just container logs; packing small files, burst buffer

– I/O Store layer

Non-POSIX backends (HDFS, IOFSL, PVFS)
Compression, write buffering, IO forwarding

– Container index extensions

PLFS is open source, available on github

Extensibility

Chuck Cranor, Milo Polte, Garth Gibson

– Interposed filesystem b/w apps & backing storage – Los Alamos National Labs, CMU, EMC, … – Target: HPC checkpoint files

concurrent write access pattern to a pattern more efficient for distributed filesystems

– First paper: Bent et al, Supercomputer 2009 – http://github.com/plfs, http://institute.lanl.gov/plfs/

What is PLFS?

– N-1: all N processes write to one shared file

– N-N: each process writes to its own file

Checkpoint Write Patterns

transforming it into an N-N workload

no application changes required

PLFS Transforms Workloads

100X 10X

– N-1 checkpoint on mounted posix filesystem – All data stored in PLFS container logs

– Feasibly deploy on leadership class machines

– Requires maintainability & roadmap evolution – Develop a team: LANL, EMC, CMU, …

The Price of Success

PLFS Extensibility Architecture

– Key attributes:

– Observation: PLFS log I/O fits HDFS semantics

Case Study: HPC in the Cloud

– Needs a kernel mounted filesystem – Uses integer file descriptors – Memory maps index files to read them

– Insert a layer of indirection above PLFS backend – Model after POSIX API to minimize code changes

PLFS Backend Limitations

16GB RAM, 1TB drive, gigabit ethernet

PLFS/HDFS Benchmark

Benchmark Operation

PLFS Implementation Architecture

PLFS/HDFS Write Bandwidth

PLFS/HDFS Write Bandwidth

PLFS/HDFS Write Bandwidth

PLFS/HDFS Read Bandwidth

PLFS/HDFS Read Bandwidth

HDFS 1 vs 3: I/O Scheduling

– Production-level code – Multiple concurrent instances of I/O Stores

– https://github.com/PLFS

I/O Store Status

– Logical FS interface

– I/O Store layer

– Container index extensions

– http://github.com/plfs – Developer email: plfs-devel@lists.sourceforge.net

Conclusions