An Introduction to the Lustre Parallel File System Tom Edwards - PowerPoint PPT Presentation

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E

Agenda ● Introduction to storage hardware ● RAID ● Parallel Filesystems ● Lustre ● Mapping Common IO Strategies to Lustre ● Spokesperson ● Multiple writers – multiple files ● Multiple writers – single file ● Collective IO ● Tuning Lustre Settings ● Case studies ● Conclusions C O M P U T E | S T O R E | A N A L Y Z E 2

Building blocks of HPC file systems ● Modern Supercomputer hardware is typically built on two fundamental pillars: The use of widely available commodity (inexpensive) hardware. 1. Using parallelism to achieve very high performance. 2. ● The file systems connected to computers are built in the same way ● Gather large numbers of widely available, inexpensive, storage devices; ● then connect them together in parallel to create a high bandwidth, high capacity storage device. C O M P U T E | S T O R E | A N A L Y Z E

Commodity storage ● There are typically two commodity storage technologies that are found in HPC file-systems Hard Disk Drives (HDD) Solid State Devices (SSD) Description Data stored magnetically on Data stored in integrated spinning disk platters, read and circuits, typically NAND flash written by a moving “head” memory • • Advantages Large capacity (TBs) Very low seek latency • • Inexpensive High Bandwidth (~500MB/s) • Lower power draw • • Disadvantages Higher seek latency Expensive • • Lower bandwidth Smaller Capacity (GBs) • (<100MB/s) Limited life span • Higher power draw ● HDDs much more common but SSDs look promising. ● Both are commonly referred to as “Block Devices” C O M P U T E | S T O R E | A N A L Y Z E

Redundant Arrays of Inexpensive Disks (RAID) ● RAID is a technology for combining multiple smaller block devices into a single larger/faster block device ● Specialist RAID controllers automatically distribute data in fixed size “blocks” or “stripes” over the individual disks ● Striping blocks over multiple disks allows data to read and written in parallel resulting in higher bandwidth – (RAID0) Server RAID Device RAID Controller /file/data Large file written to RAID Higher device Blocks aggregate bandwidth distributed C O M P U T E | S T O R E | A N A L Y Z E

Redundant Arrays of Inexpensive Disks (RAID) ● Only using striping exposes data to increased risk as it is likely that all data will be lost if any one drive fails ● To protect against this, the controller can store additional “parity” blocks which allow the array to survive one or two disks failing – (RAID5 / RAID6) ● Additional drives are required but the data’s integrity is ensured Server RAID Device RAID Controller Additional /file/data parity blocks Large written to file “spare” written disks to RAID Higher device Blocks aggregate bandwidth distributed C O M P U T E | S T O R E | A N A L Y Z E

Degraded arrays ● A RAID6 array can survive any two drives failing ● Once the faulty drives are replaced, the array has to be rebuilt from the data on the existing drives ● Rebuilds can happen while the array is running, but may take many hours to complete and will reduce the performance of the array Server RAID Device RAID Controller Additional X /file/data parity blocks Large written to file “spare” written disks to RAID X Higher device Blocks aggregate bandwidth distributed C O M P U T E | S T O R E | A N A L Y Z E

Combining RAID devices in to a parallel filesystem ● There are economic and practical limits on the size of individual RAID6 arrays ● Most common arrays contain around 10 drives ● This limits capacity to Terabytes and bandwidth to a few GB/s ● It may also be difficult to share the file system with many client nodes. ● To achieve required performance supercomputers combine multiple RAID devices to form a single parallel file system ● ARCHER and many other supercomputers use the Lustre parallel file system ● Lustre joins multiple block devices (RAID arrays) into a single file system that applications can read/write from/to in parallel. ● Scales to hundreds of block devices and 100,000s of client nodes. C O M P U T E | S T O R E | A N A L Y Z E

Lustre Building Blocks - OSTs ● Object Storage Targets (OST) – These are block devices that data will be distributed over. These are commonly RAID6 arrays of HDDs. ● Object Storage Server (OSS) – A dedicated server that is directly connected to one or more OSS. These are usually connected to the supercomputer via a high performance network ● MetaData Server (MDS) – A single server per file system that is responsible for holding meta data on individual files ● Filename and location ● Permissions and access control ● Which OSTs data is held on. ● Lustre Clients – Remote clients that can mount the Lustre filesystem, e.g. Cray XC30 Compute nodes. C O M P U T E | S T O R E | A N A L Y Z E

Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Lustre Client Client Client Client Client Client Lustre Client Lustre Client Lustre Client Lustre Client Lustre Client Client Client Client Client Client Client Client Client Client Multiple OSSs and High Performance Computing Interconnect OSTS Metadata Object Storage Object Storage Object Storage Object Storage Object Storage Server (OSS) + Server (OSS) + Server (OSS) + Server (OSS) + Server (OSS) + Server Object Storage Object Storage Object Storage Object Storage Object Storage (MDS) Target (OST) Target (OST) Target (OST) Target (OST) Target (OST) name permissions attributes location One MDS per filesystem C O M P U T E | S T O R E | A N A L Y Z E 10

ARCHER’s Lustre – Cray Sonexion Storage MMU: Metadata Management Unit Lustre MetaData Server Contains server hardware and storage ● SSU: Scalable Storage Unit 2 x OSSs and 8 x OSTs Contains Storage controller, Lustre server, disk ● Multiple SSUs are combined to form controller and RAID engine storage racks C O M P U T E | S T O R E | A N A L Y Z E Each unit is 2 OSSs each with 4 OSTs of 10 ● (8+2) disks in a RAID6 array 11

ARCHER’s File systems Connected to the Cray XC30 via LNET router service nodes. Infiniband Network /fs4 /fs3 /fs2 7 SSUs 6 SSUs 6 SSUs 14 OSSs 12 OSSs 12 OSSs 56 OSTs 48 OSTs 48 OSTs 560 HDDs 480 HDDs 480 HDDs 4TB per HDD 4TB per HDD 4TB per HDD 1.6 PB Total 1.4 PB Total 1.4 PB Total C O M P U T E | S T O R E | A N A L Y Z E

Lustre data striping Lustre’s performance comes from striping files over multiple OSTs OS/file-system Single logical user file automatically divides e.g. the file into stripes /work/y02/y02/ted Stripes are then read/written to/from their assigned OST C O M P U T E | S T O R E | A N A L Y Z E 13

RAID blocks vs Lustre Stripes ● RAID blocks and Lustre stripes appear, at least on the surface, to perform the similar function, however there are some important differences. RAID Stripes/Blocks Lustre Stripes Redundancy RAID OSTs are typically Lustre provides no redundancy, if configured with RAID6 to an individual OST becomes ensure data integrity if an available, all files using that array individual drives failed are inaccessible Flexibility The block/stripe size and The number and size of the Lustre distribution is chosen at when stripes used can be controlled by the array is created and cannot the user on a file-by-file when the be changed by the user file is created (see later). Size Lustre stripe sizes are usually between 1 and 32 MB C O M P U T E | S T O R E | A N A L Y Z E

Opening a file The client sends a request to the MDS to opening/acquiring information about the file Metadata Open Lustre Server The MDS then passes back a list of OSTs (MDS) Client • OSTs For an existing file, these contain the name data stripes permissions • attributes For a new files, these typically contain a location randomly assigned list of OSTs where data is to be stored Object Storage Object Storage Server (OSS) + Server (OSS) + Object Storage Object Storage Read/write Once a file has been opened no Target (OST) Target (OST) Lustre further communication is required Client between the client and the MDS All transfer is directly between the assigned OSTs and the client C O M P U T E | S T O R E | A N A L Y Z E 15

File decomposition – 2 Megabyte stripes 2MB 2MB 2MB 2MB 2MB 2MB 2MB 2MB Lustre 3-0 5-0 7-0 11-0 3-1 5-1 7-1 11-1 Client 11-1 11-0 3-1 OST 5-1 7-1 11 7-0 3-0 5-0 OST 7 OST 5 OST 3 C O M P U T E | S T O R E | A N A L Y Z E 16

An Introduction to the Lustre Parallel File System Tom Edwards - PowerPoint PPT Presentation

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E Agenda Introduction to storage hardware RAID Parallel Filesystems Lustre Mapping

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops & Stuff, LLNL) May 21, 2019

Multi-VO Support YAN Tian for Distributed Computing Group Meeting Oct. 23, 2014 StoRM + Lustre:

Lustre V6 Synchronous Team VERIMAG, Grenoble 2 Lustre Basics Structuration Only nodes

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

on Cray Systems Cory Spitz and Ann Koehler Cray Inc. 5/25/2011 Introduction Lustre is a

Lustre at GSI - Evaluation of a cluster file system Walter Schn, GSI Walter Schn, GSI Topic

DSS Data & Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne

Cray Centre of Excellence for HECToR This talk is not about how to get maximum performance from

Verifying a Lustre Compiler Part 2 Llio Brun PARKAS (Inria - ENS) Timothy Bourke,

Distributed Systems Reasons for distributed systems Resource sharing sharing and

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Designing an NFS-based Mobile Distributed File System for Ephemeral Sharing in Proximity Networks

GlobalFS: A Strongly Consistent Multi-Site Filesystem Leandro Pacheco Raluca Halalai Valerio

Security and Integrity of a Distributed File Storage in a Virtual Environment Gaspare Sala 1

Direct-FUSE: A User-level File System with Multiple Backends Yue Zhu yzhu@cs.fsu.edu Florida

Brought Distributed File Sharing Back Alex Afanasyev , Zhenkai Zhu, Yingdi Yu, Lijing Wang, and

Distributed Systems Principles and Paradigms Chapter 10 (version April 7, 2008 ) Maarten van

An Introduction to the Lustre Parallel File System Tom Edwards - PowerPoint PPT Presentation

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E | S T O R E | A N A L Y Z E Agenda Introduction to storage hardware RAID Parallel Filesystems Lustre Mapping

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops &amp; Stuff, LLNL) May 21, 2019

Multi-VO Support YAN Tian for Distributed Computing Group Meeting Oct. 23, 2014 StoRM + Lustre:

Lustre V6 Synchronous Team VERIMAG, Grenoble 2 Lustre Basics Structuration Only nodes

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

on Cray Systems Cory Spitz and Ann Koehler Cray Inc. 5/25/2011 Introduction Lustre is a

Lustre at GSI - Evaluation of a cluster file system Walter Schn, GSI Walter Schn, GSI Topic

DSS Data &amp; Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne

Cray Centre of Excellence for HECToR This talk is not about how to get maximum performance from

Verifying a Lustre Compiler Part 2 Llio Brun PARKAS (Inria - ENS) Timothy Bourke,

Distributed Systems Reasons for distributed systems Resource sharing sharing and

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Designing an NFS-based Mobile Distributed File System for Ephemeral Sharing in Proximity Networks

GlobalFS: A Strongly Consistent Multi-Site Filesystem Leandro Pacheco Raluca Halalai Valerio

Security and Integrity of a Distributed File Storage in a Virtual Environment Gaspare Sala 1

Direct-FUSE: A User-level File System with Multiple Backends Yue Zhu yzhu@cs.fsu.edu Florida

Brought Distributed File Sharing Back Alex Afanasyev , Zhenkai Zhu, Yingdi Yu, Lijing Wang, and

Distributed Systems Principles and Paradigms Chapter 10 (version April 7, 2008 ) Maarten van

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops & Stuff, LLNL) May 21, 2019

DSS Data & Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne