STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST - PowerPoint PPT Presentation

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | | CEA/DAM/DIF PATC PARALLEL I/O APRIL 25TH, 2018 23 AVRIL 2018 CEA | 10 AVRIL 2012 | PAGE 1

AGENDA TGCC storage architecture TGCC storage workspaces Lustre parallel file system Hierarchical Storage PATC Parallel I/O 23/04/2018

TGCC ARCHITECTURE 7.5 PB file systems Level 1: 1PB of disks (work, store) Level 2: 30 PB of tapes scratch 23/04/2018 PATC Parallel I/O | PAGE 3

LUSTRE FILE SYSTEMS @ TGCC (1/2) scratch Workspace for temporary data Mount point: /ccc/scratch ($CCCSCRATCHDIR) Unused files deleted after 40 days Designed for throughput and performance store Long term storage: should be used to store final results Connected to a HSM (see later slides) for bigger capacity Recommended file size : 1GB-100GB Quotas : 100k inodes per user, no quota on volume Automated migration and staging with the HSM (see later slides) Mount point: /ccc/store ($CCCSTOREDIR) Designed for data capacity 23/04/2018 | PAGE 4

LUSTRE FILE SYSTEMS @ TGCC (2/2) work Permanent workspace (no purge) Accessible from all compute clusters Quotas : 1TB, 500k inodes per user Mount point: /ccc/work ($CCCWORKDIR) 23/04/2018 | PAGE 5

LUSTRE PARALLEL FILE SYSTEM Storage @ TGCC| 22 APRIL 2016 | PAGE 6 23 AVRIL 2018

FROM LOCAL FILESYSTEMS … TO PARALLEL FILESYSTEMS Local file system File server Parallel file system 1 disk, 1 client 1 server, N clients M servers, N clients Examples: Example: Example: • Personnal computer Login home scratch of supercomputer • /tmp of compute nodes Interests: Interests: sharing, access from any scalability, performance, workstation fault tolerance 23 AVRIL 2018 PATC Parallel I/O | PAGE 7

LUSTRE: A PARALLEL FILE SYSTEM Compute code create Metadata mkdir File Data (attributes) rename directories, ls file names, chmod access rights, dates… … … Metadata Data servers server (OSS) (MDS) Extensible Low extensibility Cost unit: volume Cost unit: inode 23 AVRIL 2018 PATC Parallel I/O | PAGE 8

LUSTRE: HARDWARE REDUNDANCY Hardware redundancy of Lustre filesystems Data servers 1 metadata (OSS) server + failover … RAID10: Disk arrays 2 disks + 2 mirrors ∑ RAID : 8 disks + 2 parity + + 23 AVRIL 2018 PATC Parallel I/O | PAGE 9

LUSTRE STRIPING What is striping? To increase data throughput Lustre can paralellize file storage on several servers etc. … Stripe count = 1 Stripe count = 2 Stripe count = N Data is distributed across servers as blocks of « stripe size » Example: stripe_count=4 stripe_size=1MB 23 AVRIL 2018 PATC Parallel I/O | PAGE 10

STRIPING: RECOMMENDATIONS What striping should be set? Striping > 1 induces extra costs (N servers to communicate with) but results in an increased bandwidth Useless for small files (< a few MB) Worthwhile for bigger files (~ Gigabyte-sized) If accessed from a single client: stripe_count = 2 is enough to get the max throughput Increase stripe count if many clients write large volumes of data to the same file - As much as possible, align writes with stripe_size Mandatory for huge files (x100 GB): avoid having more than 500GB / server 23 AVRIL 2018 PATC Parallel I/O | PAGE 11

SETTING STRIPE How to set stripe? Per directory File and sub-directories inherit when they are created Only affects new file creation (not previously created files) Command: lfs setstripe –c <stripe_count> <directory> Default stripe @ TGCC 1 on scratch and work 4 on store 4 with MPI-IO 23 AVRIL 2018 PATC Parallel I/O | PAGE 12

LUSTRE: BEST PRACTICES Best practices Avoid using « ls –l » when « ls » is enough Avoid having a huge number of files in a single directory (<1000) Avoid small files on Lustre filesystems Use a stripe count of 1 for directories with many small files Lustre filesystems are not backed up: keep critical data (e.g. source code) in your home Limit the number of processes writing to the same file (locking contentions) Avoid starting executables on Lustre (they run slower) Avoid repetitive open/close operations Example of wrong script: while … do echo ‘bla’ >> my_file.out done Open « read-only » when only reading a file to reduce locking contentions In Fortran, use ACTION='read' instead of the default ACTION='readwrite‘ More details Google « Lustre Best Practices »: some sites have good doc available online (NASA, NICS…) 23 AVRIL 2018 PATC Parallel I/O | PAGE 13

Store: Hierarchical Storage Management CEA | 10 AVRIL 2012 | PAGE 14 23/04/2018

BASEMENT OF HSM Data « sedimentation » New data Performance Access to €€€ an old file recently used unused €€ HPSS disks € Capacity Cost/GB HPSS Tapes 23 AVRIL 2018 PATC Parallel I/O | PAGE 15

DATA MIGRATION How HSM works store is permanently watched by a Policy Engine ( Robinhood ) Eligible files for migration are automatically stored in HPSS The filesystem is saved in the HSM Possible recovery in case of crash, major hardware failure, FS been reformatted Older files are Still visible in store with their original size Their contents are out of store and kept in HPSS This is fully transparent to the end-user Space freed in store is available for new files Freed files are staged back at first access Transparent to the end-user The first IO call is blocked until the stage operation is completed 23/04/2018 | PAGE 16

A FILE’S LIFE Creation new Copied in HPSS Disk space is freed archived/ released synchro Stage operation HPSS Copy Modification modified/ dirty online offline 17

FILES STATUS 18

USER INTERFACE Users’ view: User has access to data via a standardized path: /ccc/store/contxxx/grp/usr ($STOREDIR) No direct access to HPSS, it’s « hidden » behind store Regular commands apply to store Accessing a released file stages it back to LUSTRE. Data access is blocked until the transfer is completed. ccc_hsm command: ccc_hsm status : query file status (online, released, …) ccc_hsm get : prefetch files ccc_hsm ls : does « ls » but show hsm status (online, offline) too 19

CCC_HSM GET Preloading data Retrieving data from tapes can be long : mounting and reading magnetic tapes It is advised to preload data before submitting a job (to reuse or post- process an old computation ) Preloading data avoid wasting compute time « ccc_hsm get » to preload ‘released’ files 23 AVRIL 2018 Storage @ TGCC| APRIL 25TH, 2018 | PAGE 20

‘DU’ ON STORE What ‘du’ displays on /ccc/store ? By default, ‘du’ displays space used on disk , i.e. only on the Lustre level: du -sh $CCSTOREDIR 2T (?!) If you want to get the total usage for both Lustre and HPSS use ‘–b’ option : du -bsh $CCCSTOREDIR 224T ☺ 23 AVRIL 2018 PATC | APRIL 25TH, 2018 | PAGE 21

BEST PRACTICE (STORE): PACKING DATA Pack your data into big files The time to reload each file from tape is significant: Time to move & load the tape in a tape drive, time to rewind the tape… � Packing data into bigger files makes it possible to reduce the time to read-back data from tapes Example: Reading the same amount of data (100GB) from tapes 10 files x 10GB 1000 files x 100MB Time to read-back from A few minutes Several hours tapes Overall system usage Partial Full (tape drives) Recommended file size: 1GB to 500GB 23 AVRIL 2018 PATC | APRIL 25TH, 2018 | PAGE 22

TAR IS YOUR FRIEND TAR is dangerous only in cigarettes Using “tar” command is an easy way of packing files tar cf output.tar source_directory Tools exist to access tarballs from software Tarfiles follow a well known standard See libarchive for example TAR preserves metadata Permissions Owners/groupes TAR preserves symlinks TAR can be appended Alternative: you can use cpio if you prefer ;-) Thinking on a framework to perform IO in simulation code is never a bad idea. 23/04/2018 | PAGE 23

KEEP IN MIND WHAT THE RESOURCES ARE MADE FOR • STOREDIR = LONG-TERM & CAPACITY • WORKDIR = WORKING & SHARING • SCRATCH = TEMPORARY & PERFORMANCES 23/04/2018 | PAGE 24

Thanks for your attention Questions? 23 AVRIL 2018 | PAGE 25

| PAGE 26 CEA/DAM/DIF Commissariat à l’énergie atomique et aux énergies alternatives Centre de Saclay | 91191 Gif-sur-Yvette Cedex CEA | 10 AVRIL 2012 Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019 23 AVRIL 2018

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST - PowerPoint PPT Presentation

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | | CEA/DAM/DIF PATC PARALLEL I/O APRIL 25TH, 2018 23 AVRIL 2018 CEA | 10 AVRIL 2012 | PAGE 1 AGENDA TGCC storage architecture TGCC storage workspaces

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

Introduction Introduction to storage and to storage and filesystems filesystems Introduction

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

DSS Data & Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops & Stuff, LLNL) May 21, 2019

Multi-VO Support YAN Tian for Distributed Computing Group Meeting Oct. 23, 2014 StoRM + Lustre:

Lustre V6 Synchronous Team VERIMAG, Grenoble 2 Lustre Basics Structuration Only nodes

This time we'll talk about filesystems. We'll start out by looking at disk partitions, which are

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

Signature Synthesizer Jonas Zaddach Mariano Graziano @jzaddach @emd3l INTRODUCTION Mariano

Motivation for IDS Developing absolutely secure systems is Intrusion Detection (IDS) not

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

Cloud Security VS Cybercrime Economy: The Kaspersky Vision Eugene Kaspersky Co-founder &

Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth

NFSv4.1/pNFS Ready for Prime Time Deployment February 15, 2012 FAST 2012 San Jose NFSv4.1

elfutils debuginfo-server necessary non-evil Mark Wielaard, Frank Ch. Eigler Red Hat

Discussion on Space-Efficient Block Storage Integrity Moderated by Sam Small 600.624 Advanced

Sambuz

Useful Links

Newsletter

Mail Us

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST - PowerPoint PPT Presentation

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | | CEA/DAM/DIF PATC PARALLEL I/O APRIL 25TH, 2018 23 AVRIL 2018 CEA | 10 AVRIL 2012 | PAGE 1 AGENDA TGCC storage architecture TGCC storage workspaces

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

Introduction Introduction to storage and to storage and filesystems filesystems Introduction

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

An Introduction to the Lustre Parallel File System Tom Edwards tedwards@cray.com C O M P U T E

DSS Data &amp; Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops &amp; Stuff, LLNL) May 21, 2019

Multi-VO Support YAN Tian for Distributed Computing Group Meeting Oct. 23, 2014 StoRM + Lustre:

Lustre V6 Synchronous Team VERIMAG, Grenoble 2 Lustre Basics Structuration Only nodes

This time we'll talk about filesystems. We'll start out by looking at disk partitions, which are

Hard State Revisited: Network Filesystems Hard State Revisited: Network Filesystems Jeff Chase

Signature Synthesizer Jonas Zaddach Mariano Graziano @jzaddach @emd3l INTRODUCTION Mariano

Motivation for IDS Developing absolutely secure systems is Intrusion Detection (IDS) not

Computer Security DD2395 http://www.csc.kth.se/utbildning/kth/kurser/DD2395/dasak10/ Spring 2010

Cloud Security VS Cybercrime Economy: The Kaspersky Vision Eugene Kaspersky Co-founder &amp;

Envisioning a Parallel File System without Dedicated Metadata Servers Qing Zheng Kai Ren, Garth

NFSv4.1/pNFS Ready for Prime Time Deployment February 15, 2012 FAST 2012 San Jose NFSv4.1

elfutils debuginfo-server necessary non-evil Mark Wielaard, Frank Ch. Eigler Red Hat

Discussion on Space-Efficient Block Storage Integrity Moderated by Sam Small 600.624 Advanced

Sambuz

Useful Links

Newsletter

Mail Us

DSS Data & Storage Services CERN Lustre Evaluation and Storage Outlook Tim Bell Arne

Un-scratching Lustre MSST 2019 Cameron Harr (Lustre Ops & Stuff, LLNL) May 21, 2019

Cloud Security VS Cybercrime Economy: The Kaspersky Vision Eugene Kaspersky Co-founder &