Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth - - PowerPoint PPT Presentation

structuring plfs for extensibility
SMART_READER_LITE
LIVE PREVIEW

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth - - PowerPoint PPT Presentation

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University What is PLFS? Parallel Log Structured File System Interposed filesystem b/w apps & backing storage


slide-1
SLIDE 1
  • Structuring PLFS for

Extensibility

Chuck Cranor, Milo Polte, Garth Gibson

PARALLEL DATA LABORATORY

Carnegie Mellon University

slide-2
SLIDE 2
  • Parallel Log Structured File System

– Interposed filesystem b/w apps & backing storage – Los Alamos National Labs, CMU, EMC, … – Target: HPC checkpoint files

  • PLFS transparently transforms a highly

concurrent write access pattern to a pattern more efficient for distributed filesystems

– First paper: Bent et al, Supercomputer 2009 – http://github.com/plfs, http://institute.lanl.gov/plfs/

What is PLFS?

2

slide-3
SLIDE 3
  • The two main checkpoint write patterns:

– N-1: all N processes write to one shared file

  • Concurrent I/O to a single file is often unscalable
  • Small, unaligned, clustered traffic is problematic

– N-N: each process writes to its own file

  • Overhead of inserting many files in a single dir
  • Easier for DFS (after files created)
  • Archival and management more difficult
  • Initial PLFS focus: improve N-1 case

3

Checkpoint Write Patterns

slide-4
SLIDE 4
  • PLFS improves N-1 performance by

transforming it into an N-N workload

  • FUSE/MPI: transparent solution,

no application changes required

4

PLFS Transforms Workloads

slide-5
SLIDE 5
  • PLFS Converts N-1 to N-N

PLFS Virtual Layer /foo host1 host2 host3 /foo/ hostdir.1/ hostdir.2/ hostdir.3/ 131 132 279 281 152 148 data.131 indx.131 data.132 indx.132 data.279 indx.279 data.281 indx.281 data.152 indx.152 data.148 indx.148 Physical Underlying Parallel File System

5

slide-6
SLIDE 6
  • PLFS N-1 Bandwidth Speedups

100X 10X

SPEED UP

6

slide-7
SLIDE 7
  • Original PLFS was limited to 1 workload:

– N-1 checkpoint on mounted posix filesystem – All data stored in PLFS container logs

  • Ported first to MIO-IO/ROMIO

– Feasibly deploy on leadership class machines

  • Success with LANL apps: actual adoption?

– Requires maintainability & roadmap evolution – Develop a team: LANL, EMC, CMU, …

  • Revisit code with maintainability in mind

7

The Price of Success

slide-8
SLIDE 8
  • 8

PLFS Extensibility Architecture

PLFS high-level API MDHIM w/LevelDB HPC Application Logical FS interface container small file flat file Index API distributed pattern byte-range I/O Store interface posix pvfs iofsl hdfs libhdfs/jvm hdfs.jar libplfs

slide-9
SLIDE 9
  • Emergence of Hadoop: converged storage
  • HDFS: Hadoop Distributed Filesystem

– Key attributes:

  • Single sequential writer (not POSIX, no pwrite)
  • Not VFS mounted, access through Java API
  • Local storage on nodes (converged)
  • Data replicated ~3 times (local+remote1+remote2)
  • HPC in the Cloud: N-1 checkpoint on HDFS?

– Observation: PLFS log I/O fits HDFS semantics

Case Study: HPC in the Cloud

9

slide-10
SLIDE 10
  • PLFS hardwired to POSIX API:

– Needs a kernel mounted filesystem – Uses integer file descriptors – Memory maps index files to read them

  • HDFS does not fit these assumptions
  • Solution: I/O Store

– Insert a layer of indirection above PLFS backend – Model after POSIX API to minimize code changes

PLFS Backend Limitations

10

slide-11
SLIDE 11
  • PLFS I/O Store Architecture

lib{hdfs,jvm} hdfs.jar PLFS FUSE PLFS MPI I/O posix libc API PLFS container I/O store posix i/o HDFS i/o libplfs mounted fs Java code

11

slide-12
SLIDE 12
  • Testbed: PRObE (www.nmc-probe.org)
  • Each node has dual 1.6GHz AMD cores,

16GB RAM, 1TB drive, gigabit ethernet

  • Ubuntu Linux, HDFS 0.21.0, PLFS, OpenMPI
  • Benchmark: LANL FS Test Suite (fs_test)
  • Simulates N-1 checkpoint, strided
  • Filesystems tested:

– PVFS OrangeFS 2.8.4 w/64MB stripe size – PLFS/HDFS w/1 replica (local disk) – PLFS/HDFS w/3 replicas (local disk + remote1 + remote 2)

  • Blocksizes: 47001, 48K, 1M
  • Checkpoint size: 32GB written by 64 nodes

PLFS/HDFS Benchmark

12

slide-13
SLIDE 13
  • 13

Benchmark Operation

3 stride block remaining strides continue pattern for write phase read phase nodes nodes (shifted for read) 2 3 1 1 2

We unmount and cache flush data filesystem between read/write

slide-14
SLIDE 14
  • FUSE filesystem and a Middleware lib (MPI)

PLFS Implementation Architecture

PLFS FUSE daemon PLFS FUSE app proc1 PLFS FUSE app proc2 PLFS MPI app proc1 PLFS MPI app proc2

FUSE module VFS/POSIX API

interconnect

Local fs

Distributed fs

PLFS lib

PLFS/ MPI libs PLFS/ MPI libs

use r kernel

app i/o FUSE upcall backing store i/o MPI sync calls to disk to network to other nodes

14

slide-15
SLIDE 15
  • 47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

15

PLFS/HDFS Write Bandwidth

slide-16
SLIDE 16
  • 47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

16

PLFS/HDFS Write Bandwidth

  • PLFS/HDFS performs well (note HDFS1 is local disk)
slide-17
SLIDE 17
  • 47001

48K 1M access unit size (bytes) 500 1000 1500 2000 write bandwidth (Mbytes/s) PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write

17

PLFS/HDFS Write Bandwidth

  • PLFS/HDFS performs well (note HDFS3 is 3 copies)
slide-18
SLIDE 18
  • 47001

48K 1M access unit size (bytes) 500 1000 read bandwidth (Mbytes/s) PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read

18

PLFS/HDFS Read Bandwidth

  • HDFS with small access size benefits from PLFS log grouping
slide-19
SLIDE 19
  • 47001

48K 1M access unit size (bytes) 500 1000 read bandwidth (Mbytes/s) PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read

19

PLFS/HDFS Read Bandwidth

  • HDFS3 with large access size suffers imbalance
slide-20
SLIDE 20
  • 10

20 30 40 50 60 Node number 500 1000 Total size of data served (MB) PLFS/HDFS1 PLFS/HDFS3

20

HDFS 1 vs 3: I/O Scheduling

  • Network counters show HDFS3 read imbalance
slide-21
SLIDE 21
  • Rewrote initial I/O Store prototype

– Production-level code – Multiple concurrent instances of I/O Stores

  • Re-plumbed entire backend I/O path
  • Prototyped POSIX, HDFS, PVFS stores

– IOFSL done by EMC

  • Regression tested at LANL
  • I/O Store now part of PLFS released code

– https://github.com/PLFS

21

I/O Store Status

slide-22
SLIDE 22
  • PLFS extensions for workload transformation:

– Logical FS interface

  • Not just container logs; packing small files, burst buffer

– I/O Store layer

  • Non-POSIX backends (HDFS, IOFSL, PVFS)
  • Compression, write buffering, IO forwarding

– Container index extensions

  • PLFS is open source, available on github

– http://github.com/plfs – Developer email: plfs-devel@lists.sourceforge.net

22

Conclusions