STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST - - PowerPoint PPT Presentation

storage tgcc lustre filesystems
SMART_READER_LITE
LIVE PREVIEW

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST - - PowerPoint PPT Presentation

STORAGE@TGCC & LUSTRE FILESYSTEMS WORKING & BEST PRACTICES Philippe DENIEL | | CEA/DAM/DIF PATC PARALLEL I/O APRIL 25TH, 2018 23 AVRIL 2018 CEA | 10 AVRIL 2012 | PAGE 1 AGENDA TGCC storage architecture TGCC storage workspaces


slide-1
SLIDE 1

STORAGE@TGCC & LUSTRE FILESYSTEMS

WORKING & BEST PRACTICES

PATC PARALLEL I/O APRIL 25TH, 2018

Philippe DENIEL || CEA/DAM/DIF 23 AVRIL 2018 | PAGE 1 CEA | 10 AVRIL 2012

slide-2
SLIDE 2

AGENDA TGCC storage architecture TGCC storage workspaces Lustre parallel file system Hierarchical Storage

23/04/2018 PATC Parallel I/O

slide-3
SLIDE 3

TGCC ARCHITECTURE

23/04/2018 PATC Parallel I/O | PAGE 3 Level 1: 1PB of disks Level 2: 30 PB of tapes 7.5 PB file systems (work, store) scratch

slide-4
SLIDE 4

LUSTRE FILE SYSTEMS @ TGCC (1/2)

scratch

Workspace for temporary data Mount point: /ccc/scratch ($CCCSCRATCHDIR) Unused files deleted after 40 days Designed for throughput and performance

store

Long term storage: should be used to store final results Connected to a HSM (see later slides) for bigger capacity Recommended file size : 1GB-100GB Quotas : 100k inodes per user, no quota on volume Automated migration and staging with the HSM (see later slides) Mount point: /ccc/store ($CCCSTOREDIR) Designed for data capacity

23/04/2018 | PAGE 4

slide-5
SLIDE 5

LUSTRE FILE SYSTEMS @ TGCC (2/2)

work

Permanent workspace (no purge) Accessible from all compute clusters Quotas : 1TB, 500k inodes per user Mount point: /ccc/work ($CCCWORKDIR)

23/04/2018 | PAGE 5

slide-6
SLIDE 6

LUSTRE PARALLEL FILE SYSTEM

23 AVRIL 2018 | PAGE 6 Storage @ TGCC| 22 APRIL 2016

slide-7
SLIDE 7

FROM LOCAL FILESYSTEMS … TO PARALLEL FILESYSTEMS

23 AVRIL 2018 | PAGE 7

Local file system 1 disk, 1 client Examples:

  • Personnal computer
  • /tmp of compute nodes

Example: Login home File server 1 server, N clients Interests: sharing, access from any workstation Parallel file system M servers, N clients Example: scratch of supercomputer Interests: scalability, performance, fault tolerance

PATC Parallel I/O

slide-8
SLIDE 8

LUSTRE: A PARALLEL FILE SYSTEM

23 AVRIL 2018 | PAGE 8 PATC Parallel I/O

Metadata server (MDS) Data servers (OSS)

Compute code

Low extensibility Cost unit: inode Extensible Cost unit: volume

File Data create mkdir rename ls chmod … Metadata (attributes)

directories, file names, access rights, dates…

slide-9
SLIDE 9

LUSTRE: HARDWARE REDUNDANCY Hardware redundancy of Lustre filesystems

23 AVRIL 2018 | PAGE 9 PATC Parallel I/O

1 metadata server + failover Data servers (OSS)

Disk arrays

+

RAID : 8 disks + 2 parity

∑ +

RAID10: 2 disks + 2 mirrors

slide-10
SLIDE 10

LUSTRE STRIPING What is striping?

To increase data throughput Lustre can paralellize file storage on several servers Data is distributed across servers as blocks of « stripe size »

23 AVRIL 2018 | PAGE 10 PATC Parallel I/O

Stripe count = 1 Stripe count = 2 Stripe count = N …

etc. Example: stripe_count=4 stripe_size=1MB

slide-11
SLIDE 11

STRIPING: RECOMMENDATIONS

What striping should be set?

Striping > 1 induces extra costs (N servers to communicate with) but results in an increased bandwidth Useless for small files (< a few MB) Worthwhile for bigger files (~ Gigabyte-sized) If accessed from a single client: stripe_count = 2 is enough to get the max throughput Increase stripe count if many clients write large volumes of data to the same file

  • As much as possible, align writes with stripe_size

Mandatory for huge files (x100 GB): avoid having more than 500GB / server

23 AVRIL 2018 | PAGE 11 PATC Parallel I/O

slide-12
SLIDE 12

SETTING STRIPE

How to set stripe?

Per directory File and sub-directories inherit when they are created Only affects new file creation (not previously created files) Command:

lfs setstripe –c <stripe_count> <directory>

Default stripe @ TGCC

1 on scratch and work 4 on store 4 with MPI-IO

23 AVRIL 2018 | PAGE 12 PATC Parallel I/O

slide-13
SLIDE 13

LUSTRE: BEST PRACTICES Best practices

Avoid using « ls –l » when « ls » is enough Avoid having a huge number of files in a single directory (<1000) Avoid small files on Lustre filesystems Use a stripe count of 1 for directories with many small files Lustre filesystems are not backed up: keep critical data (e.g. source code) in your home Limit the number of processes writing to the same file (locking contentions) Avoid starting executables on Lustre (they run slower) Avoid repetitive open/close operations Example of wrong script:

while … do echo ‘bla’ >> my_file.out done

Open « read-only » when only reading a file to reduce locking contentions In Fortran, use ACTION='read' instead of the default ACTION='readwrite‘

More details

Google « Lustre Best Practices »: some sites have good doc available online (NASA, NICS…)

23 AVRIL 2018 | PAGE 13 PATC Parallel I/O

slide-14
SLIDE 14

Store: Hierarchical Storage Management

23/04/2018 | PAGE 14 CEA | 10 AVRIL 2012

slide-15
SLIDE 15

BASEMENT OF HSM Data « sedimentation »

23 AVRIL 2018 | PAGE 15 PATC Parallel I/O

recently used unused

HPSS disks HPSS Tapes Performance Capacity €€€ €

Cost/GB

€€

Access to an old file New data

slide-16
SLIDE 16

DATA MIGRATION

How HSM works

store is permanently watched by a Policy Engine (Robinhood) Eligible files for migration are automatically stored in HPSS

The filesystem is saved in the HSM Possible recovery in case of crash, major hardware failure, FS been reformatted

Older files are

Still visible in store with their original size Their contents are out of store and kept in HPSS This is fully transparent to the end-user Space freed in store is available for new files

Freed files are staged back at first access

Transparent to the end-user The first IO call is blocked until the stage operation is completed

23/04/2018 | PAGE 16

slide-17
SLIDE 17

A FILE’S LIFE

17

new Creation Copied in HPSS

archived/ synchro

Modification

modified/ dirty

HPSS Copy Disk space is freed

released

Stage operation

  • nline
  • ffline
slide-18
SLIDE 18

FILES STATUS

18

slide-19
SLIDE 19

USER INTERFACE Users’ view:

User has access to data via a standardized path: /ccc/store/contxxx/grp/usr ($STOREDIR) No direct access to HPSS, it’s « hidden » behind store Regular commands apply to store Accessing a released file stages it back to LUSTRE. Data access is blocked until the transfer is completed.

ccc_hsm command: ccc_hsm status : query file status (online, released, …) ccc_hsm get : prefetch files ccc_hsm ls : does « ls » but show hsm status (online, offline) too

19

slide-20
SLIDE 20

CCC_HSM GET Preloading data

Retrieving data from tapes can be long: mounting and reading magnetic tapes It is advised to preload data before submitting a job (to reuse or post- process an old computation) Preloading data avoid wasting compute time « ccc_hsm get » to preload ‘released’ files

23 AVRIL 2018 | PAGE 20 Storage @ TGCC| APRIL 25TH, 2018

slide-21
SLIDE 21

‘DU’ ON STORE What ‘du’ displays on /ccc/store ?

By default, ‘du’ displays space used on disk, i.e. only on the Lustre level:

du -sh $CCSTOREDIR 2T (?!)

If you want to get the total usage for both Lustre and HPSS use ‘–b’ option:

du -bsh $CCCSTOREDIR 224T ☺

23 AVRIL 2018 | PAGE 21 PATC | APRIL 25TH, 2018

slide-22
SLIDE 22

BEST PRACTICE (STORE): PACKING DATA Pack your data into big files

The time to reload each file from tape is significant:

Time to move & load the tape in a tape drive, time to rewind the tape…

Packing data into bigger files makes it possible to reduce the time to read-back data from tapes

Example: Reading the same amount of data (100GB) from tapes

Recommended file size: 1GB to 500GB

23 AVRIL 2018 | PAGE 22 PATC | APRIL 25TH, 2018

10 files x 10GB 1000 files x 100MB Time to read-back from tapes A few minutes Several hours Overall system usage (tape drives) Partial Full

slide-23
SLIDE 23

TAR IS YOUR FRIEND

TAR is dangerous only in cigarettes

Using “tar” command is an easy way of packing files

tar cf output.tar source_directory

Tools exist to access tarballs from software

Tarfiles follow a well known standard See libarchive for example

TAR preserves metadata

Permissions Owners/groupes

TAR preserves symlinks TAR can be appended

Alternative: you can use cpio if you prefer ;-) Thinking on a framework to perform IO in simulation code is never a bad idea.

23/04/2018 | PAGE 23

slide-24
SLIDE 24

KEEP IN MIND WHAT THE RESOURCES ARE MADE FOR

  • STOREDIR = LONG-TERM & CAPACITY
  • WORKDIR = WORKING & SHARING
  • SCRATCH = TEMPORARY &

PERFORMANCES

23/04/2018 | PAGE 24

slide-25
SLIDE 25

Thanks for your attention Questions?

23 AVRIL 2018 | PAGE 25

slide-26
SLIDE 26

CEA/DAM/DIF Commissariat à l’énergie atomique et aux énergies alternatives Centre de Saclay | 91191 Gif-sur-Yvette Cedex

Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019

23 AVRIL 2018 | PAGE 26 CEA | 10 AVRIL 2012