Diving into Petascale Production File Systems through Large Scale - PowerPoint PPT Presentation

Diving into Petascale Production File Systems through Large Scale Profiling and Analysis Feiyi Wang Co-authors: Hyogi Sim, Cameron Harr, Sarp Oral Oak Ridge National Laboratory Lawrence Livermore National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy

Problem & Design Goals • Motivation – Multiple larg- scale production file systems – I/O workload studies at the backend storage – jobstats or darshan about I/O on per-job basis • Goals – “File characteristics and usage patterns” on a grand scale – Scalable and fast – Lightweight and infrastructure-less – Portable 2

A Quick Start $ brew install pkg-config libffi openmpi python $ pip2 install virtualenv $ virtualenv pcircle $ source ~/pcircle/bin/activate $ (pcircle) $ pip2 install git+https://github.com/olcf/pcircle@dev To run it: $ fprof ~/ or $ mpirun –np 8 fprof ~/ 3

Design Overview • Parallelization Engine (PE) – Distribute the workload across a cluster of machines to scan the file system • PE has two key components: – work stealing – distributed termination 4

Work Stealing Pattern 1. Each worker maintain a local and independent work queue. 2. If this work queue is empty, it sends work request to other neighbor processes 3. Each worker process will respond such work request from other peers, by split and distribute work items from its work queue. Each worker processes are peers, no master or slaves. How do we know or who decide when to quit? 5

Distributed Termination • Dijkstra-Scolten algorithm – All nodes are arranged into a ring – Each node maintain a state of black and white color, they also pass a token of color black & white – termination condition: a white n0 receives a white token The beauty of the solution: • peer to peer, fully distributed • self-balanced 6

Error Recovery • file system healthy != files are healthy • “Not stat-able” – error code (catch) – not return (timer) – evil case –“un-interruptable sleep state” (reboot) – https://jira.hpdd.intel.com/browse/LU-8696 • This is a case I miss master/slave: better chance of recovery 7

Other Practical Considerations • Work queue size – work queue is dynamic, its size depends on file and directory layout – extreme case: • side-by-side directories, with large # of file • 100 directories, each with 10 million files – Double ended queue, prioritize file handling • Sparse file – (1) st_block and st_size (2) FILEMAP (3) SEEK_HOLE with lseek() • LRU cache size – Single client scanning billion files – lru_size (# client side locks in a LRU queue, default: unlimited) 8

File System Snapshot and Characterizations OLCF (atlas1 & atlas2) LC (lscratche) File system Lustre Lustre back-end local file system ldiskfs ZFS Capacity (usable) 32 PB 5.7 PB File count 0.92 billion 1.3 billion Directory count 115 million 45 million Hard link count 4,390,426 309,219 Symbolic link count 7, 951, 784 10, 430,723 Sparse file count 3,240,848 N/A Max # of files in a directory 6, 006, 529 26, 646, 573 Average file size 27.67 MB 2.83 MB Largest file (top N) 32 TiB 12.77 TiB 9

Percentage (Space) Percentage (Count) File size distribution 10 10 20 30 40 10 20 30 40 0 0 4 KB (<1MB) 90.78%: Small files 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB (a) OLCF 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1 GB 4 GB (>1GB) 84.60%: Large files 64 GB 128 GB 256 GB 512 GB 1 TB 4 TB >4 TB Percentage (Space) Percentage (Count) 10 20 30 40 10 20 30 40 0 0 4 KB (<1MB) 87.84%: Small files 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB (b) LC 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1 GB 4 GB (>1GB) 42.47%: Large files 64 GB 128 GB 256 GB 512 GB 1 TB 4 TB >4 TB

Striping Pattern Stripe count is a perennial cause of performance issues - Ad hoc but justified settings: 4 (OLCF), 1 (LC) - 513,740 data points collected at OLCF (>=4GB) - 21 distinct stripe count is in use - 96.83% files stay with default setting - 2,262 changed from 4 to 1, suspect file-per- process - No correlations found between file size and stripe count - e.g. 32 TB uses default striping of 4. 11

Space Utilization Projection • Why do we want to project? – OLCF: Spider 2 (Lustre) -> Spider 3 (GPFS) – LC: Sierra is also GPFS-based – Build proposal suggested large block size (16MB, 32MB) for better performance – Trade-off: increased block size vs. potentially wasted space • How do we project? – Mn Solver Dataset • Output from Moment-Closure Solver, close to 100,000, average file size: 11KB – OLCF Spider 2 – LC (lscratche) 12

Comparison • Mn Solver Dataset: OLCF (spider 2) LC (lscratche) Mn Solver Dataset – 86.37% efficiency with 256KB block 97.91% 32 MB 87.80% 2.69% size; 2.69% with 32MB block size 98.92% 16 MB 98.84% Simulated File System Block Sizes 5.32% • OLCF Spider 2: 99.51% 8 MB 96.88% 10.33% – 32MB block for 250PB file system: 2% 99.75% 4 MB 98.50% 19.24% wasted, or about 5PB 99.94% 1 MB 99.67% 54.60% • LC: 99.97% 512 KB 99.86% 78.13% – 32 MB, 13.2% wasted 99.99% 256 KB 99.94% 86.37% – 16 MB is a much better choice 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Space Utilization Rate (Actual Data Size / Total Size with Allocated Blocks) 13

Summary • We present and demonstrate lightweight, portable and scalable profiling tools • Three use cases: – File system snapshot and characterization – Stripe pattern analysis – Simulated block analysis and projection • Available at: http://www.github.com/olcf/pcircle This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. The work at LC was performed under the auspices of the DOE by LLNL under Contract DE-AC52-07NA27344. 14

Diving into Petascale Production File Systems through Large Scale - PowerPoint PPT Presentation

Diving into Petascale Production File Systems through Large Scale Profiling and Analysis Feiyi Wang Co-authors: Hyogi Sim, Cameron Harr, Sarp Oral Oak Ridge National Laboratory Lawrence Livermore National Laboratory ORNL is managed by

Diving into Mastery Guidance for Educators Diving Deeper Deepest Aim Read Roman numerals to

Recent advances in diving medicine research DAN Europe VGE Studies A Century of Diving Medicine

Day 2: Diving Deeper into Day 2: Diving Deeper into Data Visualization with R Data Visualization

File Management What is a file? Elements of file management File organization

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller

6 and 8 Times Table and Division Facts Diving into Mastery Guidance for Educators Diving Deeper

7, 11 and 12 Multiplication Tables Diving into Mastery Guidance for Educators Diving Deeper

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Diving into Mastery Guidance for Educators Each activity sheet is split into three sections,

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Southern Diving Group SDU1 (Plymouth) / SDU2 (Portsmouth) Commander Del McKnight RN Fleet Diving

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

CPSC 410/611: File Management What is a file? Elements of file management

A Modular OpenModelica Compiler Backend J. Frenkel W. Braun A. Pop M. Sjlund

$1.7 trillion Market for IoT by 2020 IDC 25 billion Connected things by 2020 Gartner

Can I listen to that online? Jonathan Manton Music Librarian for Digital and Access Services

Faster Slab Reassignment in memcached Daniel Byrne Nilufer Onder Zhenlin Wang djbyrne@mtu.edu

the many flavors Keith Stobie Doyenz (while at Microsoft) Copies may not be made or distributed

Migration:Surfing on the Wave of Technological Evolution An ENSTORE Story Don Petravick

CSE 763 Database Seminar Herat Acharya 1 T owards Large Scale Integration: Building a

Impact of memory technology trends on performance of Web systems Mauro Andreolini Michele

Diving into Petascale Production File Systems through Large Scale - PowerPoint PPT Presentation

Diving into Petascale Production File Systems through Large Scale Profiling and Analysis Feiyi Wang Co-authors: Hyogi Sim, Cameron Harr, Sarp Oral Oak Ridge National Laboratory Lawrence Livermore National Laboratory ORNL is managed by

Diving into Mastery Guidance for Educators Diving Deeper Deepest Aim Read Roman numerals to

Recent advances in diving medicine research DAN Europe VGE Studies A Century of Diving Medicine

Day 2: Diving Deeper into Day 2: Diving Deeper into Data Visualization with R Data Visualization

File Management What is a file? Elements of file management File organization

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller

6 and 8 Times Table and Division Facts Diving into Mastery Guidance for Educators Diving Deeper

7, 11 and 12 Multiplication Tables Diving into Mastery Guidance for Educators Diving Deeper

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Diving into Mastery Guidance for Educators Each activity sheet is split into three sections,

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Southern Diving Group SDU1 (Plymouth) / SDU2 (Portsmouth) Commander Del McKnight RN Fleet Diving

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

CPSC 410/611: File Management What is a file? Elements of file management

A Modular OpenModelica Compiler Backend J. Frenkel W. Braun A. Pop M. Sjlund

$1.7 trillion Market for IoT by 2020 IDC 25 billion Connected things by 2020 Gartner

Can I listen to that online? Jonathan Manton Music Librarian for Digital and Access Services

Faster Slab Reassignment in memcached Daniel Byrne Nilufer Onder Zhenlin Wang djbyrne@mtu.edu

the many flavors Keith Stobie Doyenz (while at Microsoft) Copies may not be made or distributed

Migration:Surfing on the Wave of Technological Evolution An ENSTORE Story Don Petravick

CSE 763 Database Seminar Herat Acharya 1 T owards Large Scale Integration: Building a

Impact of memory technology trends on performance of Web systems Mauro Andreolini Michele

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of