Diving into Petascale Production File Systems through Large Scale - - PowerPoint PPT Presentation

diving into petascale production file systems through
SMART_READER_LITE
LIVE PREVIEW

Diving into Petascale Production File Systems through Large Scale - - PowerPoint PPT Presentation

Diving into Petascale Production File Systems through Large Scale Profiling and Analysis Feiyi Wang Co-authors: Hyogi Sim, Cameron Harr, Sarp Oral Oak Ridge National Laboratory Lawrence Livermore National Laboratory ORNL is managed by


slide-1
SLIDE 1

ORNL is managed by UT-Battelle for the US Department of Energy

Diving into Petascale Production File Systems through Large Scale Profiling and Analysis

Feiyi Wang Co-authors: Hyogi Sim, Cameron Harr, Sarp Oral Oak Ridge National Laboratory Lawrence Livermore National Laboratory

slide-2
SLIDE 2

2

Problem & Design Goals

  • Motivation

– Multiple larg- scale production file systems – I/O workload studies at the backend storage – jobstats or darshan about I/O on per-job basis

  • Goals

– “File characteristics and usage patterns” on a grand scale – Scalable and fast – Lightweight and infrastructure-less – Portable

slide-3
SLIDE 3

3

A Quick Start

$ brew install pkg-config libffi openmpi python $ pip2 install virtualenv $ virtualenv pcircle $ source ~/pcircle/bin/activate $ (pcircle) $ pip2 install git+https://github.com/olcf/pcircle@dev To run it: $ fprof ~/

  • r

$ mpirun –np 8 fprof ~/

slide-4
SLIDE 4

4

Design Overview

  • Parallelization Engine (PE)

– Distribute the workload across a cluster of machines to scan the file system

  • PE has two key components:

– work stealing – distributed termination

slide-5
SLIDE 5

5

Work Stealing Pattern

1. Each worker maintain a local and independent work queue. 2. If this work queue is empty, it sends work request to other neighbor processes 3. Each worker process will respond such work request from other peers, by split and distribute work items from its work queue. Each worker processes are peers, no master or slaves. How do we know or who decide when to quit?

slide-6
SLIDE 6

6

Distributed Termination

  • Dijkstra-Scolten algorithm

– All nodes are arranged into a ring – Each node maintain a state of black and white color, they also pass a token of color black & white – termination condition: a white n0 receives a white token

The beauty of the solution:

  • peer to peer, fully distributed
  • self-balanced
slide-7
SLIDE 7

7

Error Recovery

  • file system healthy != files are healthy
  • “Not stat-able”

– error code (catch) – not return (timer) – evil case –“un-interruptable sleep state” (reboot) – https://jira.hpdd.intel.com/browse/LU-8696

  • This is a case I miss master/slave: better chance of recovery
slide-8
SLIDE 8

8

Other Practical Considerations

  • Work queue size

– work queue is dynamic, its size depends on file and directory layout – extreme case:

  • side-by-side directories, with large # of file
  • 100 directories, each with 10 million files

– Double ended queue, prioritize file handling

  • Sparse file

– (1) st_block and st_size (2) FILEMAP (3) SEEK_HOLE with lseek()

  • LRU cache size

– Single client scanning billion files – lru_size (# client side locks in a LRU queue, default: unlimited)

slide-9
SLIDE 9

9

File System Snapshot and Characterizations

OLCF (atlas1 & atlas2) LC (lscratche) File system Lustre Lustre back-end local file system ldiskfs ZFS Capacity (usable) 32 PB 5.7 PB File count 0.92 billion 1.3 billion Directory count 115 million 45 million Hard link count 4,390,426 309,219 Symbolic link count 7, 951, 784 10, 430,723 Sparse file count 3,240,848 N/A Max # of files in a directory 6, 006, 529 26, 646, 573 Average file size 27.67 MB 2.83 MB Largest file (top N) 32 TiB 12.77 TiB

slide-10
SLIDE 10

10

File size distribution

10 20 30 40

87.84%: Small files (<1MB)

Percentage (Count) 10 20 30 40 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1 GB 4 GB 64 GB 128 GB 256 GB 512 GB 1 TB 4 TB >4 TB

42.47%: Large files (>1GB)

Percentage (Space) 10 20 30 40

90.78%: Small files (<1MB)

Percentage (Count) 10 20 30 40 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1 GB 4 GB 64 GB 128 GB 256 GB 512 GB 1 TB 4 TB >4 TB

84.60%: Large files (>1GB)

Percentage (Space)

(a) OLCF (b) LC

slide-11
SLIDE 11

11

Striping Pattern

Stripe count is a perennial cause of performance issues

  • Ad hoc but justified settings: 4 (OLCF), 1 (LC)
  • 513,740 data points collected at OLCF (>=4GB)
  • 21 distinct stripe count is in use
  • 96.83% files stay with default setting
  • 2,262 changed from 4 to 1, suspect file-per-

process

  • No correlations found between file size and

stripe count

  • e.g. 32 TB uses default striping of 4.
slide-12
SLIDE 12

12

Space Utilization Projection

  • Why do we want to project?

– OLCF: Spider 2 (Lustre) -> Spider 3 (GPFS) – LC: Sierra is also GPFS-based – Build proposal suggested large block size (16MB, 32MB) for better performance – Trade-off: increased block size vs. potentially wasted space

  • How do we project?

– Mn Solver Dataset

  • Output from Moment-Closure Solver, close to 100,000, average file size: 11KB

– OLCF Spider 2 – LC (lscratche)

slide-13
SLIDE 13

13

Comparison

  • Mn Solver Dataset:

– 86.37% efficiency with 256KB block size; 2.69% with 32MB block size

  • OLCF Spider 2:

– 32MB block for 250PB file system: 2% wasted, or about 5PB

  • LC:

– 32 MB, 13.2% wasted – 16 MB is a much better choice

86.37% 78.13% 54.60% 19.24% 10.33% 5.32% 2.69% 99.94% 99.86% 99.67% 98.50% 96.88% 98.84% 87.80% 99.99% 99.97% 99.94% 99.75% 99.51% 98.92% 97.91%

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 256 KB 512 KB 1 MB 4 MB 8 MB 16 MB 32 MB

Space Utilization Rate (Actual Data Size / Total Size with Allocated Blocks) Simulated File System Block Sizes OLCF (spider 2) LC (lscratche) Mn Solver Dataset

slide-14
SLIDE 14

14

Summary

  • We present and demonstrate lightweight, portable and scalable profiling tools
  • Three use cases:

– File system snapshot and characterization – Stripe pattern analysis – Simulated block analysis and projection

  • Available at: http://www.github.com/olcf/pcircle

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The work at LC was performed under the auspices of the DOE by LLNL under Contract DE-AC52-07NA27344.