Data Management Parallel Filesystems Dr David Henty HPC Training - PowerPoint PPT Presentation

Data Management Parallel Filesystems Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960

Overview • Lecture will cover – Why is IO difficult – Why is parallel IO even worse – Lustre – GPFS – Performance on ARCHER (Lustre) 14/03/2016 Parallel Filesystems 2

Why is IO hard? • Breaks out of the nice process/memory model – data in memory has to physically appear on an external device • Files are very restrictive – linear access probably implies remapping of program data – just a string of bytes with no memory of their meaning • Many, many system-specific options to IO calls • Different formats – text, binary, big/little endian, Fortran unformatted, ... • Disk systems are very complicated – RAID disks, many layers of caching on disk, in memory, ... • IO is the HPC equivalent of printing! 14/03/2016 Parallel Filesystems 3

Why is Parallel IO Harder? • Cannot have multiple processes writing a single file – Unix generally cannot cope with this – data cached in units of disk blocks (eg 4K) and is not coherent – not even sufficient to have processes writing to distinct parts of file • Even reading can be difficult – 1024 processes opening a file can overload the filesystem (fs) • Data is distributed across different processes – processes do not in general own contiguous chunks of the file – cannot easily do linear writes – local data may have halos to be stripped off 14/03/2016 Parallel Filesystems 4

Simultaneous Access to Files Process 0 Disk cache File Disk block 0 Disk block 1 Disk block 2 Disk cache Process 1 14/03/2016 Parallel Filesystems 5

Parallel File Systems • Parallel computer – constructed of many processors – each processor not particularly fast – performance comes from using many processors at once – requires distribution of data and calculation across processors • Parallel file systems – constructed from many standard disk – performance comes from reading / writing to many disks – requires many clients to read / write to different disks at once – data from a single file must be striped across many disks • Must appear as a single file system to user – typically have a single MedaData Server (MDS) – can become a bottleneck for performance 14/03/2016 Parallel Filesystems 6

Performance Interface Throughput Bandwidth (MB/s) PATA (IDE) 133 SATA 600 Serial Attached 600 SCSI (SAS) Fibre Channel 2,000 7 Parallel Filesystems

HPC/Parallel Systems • Basic cluster – Individual nodes Node Node Node Node – Network attached filesystem – Local scratch disks • Multiple I/O systems Processor/Core Network – Home and work Disk Network Attached – Optimised for production or for Filesystem user access • Many options for optimisations – Filesystem servers, caching, etc… 8 Parallel Filesystems

Parallel File Systems • Allow multiple IO processes to access same file – increases bandwidth • Typically optimised for bandwidth – not for latency – e.g. reading/writing small amounts of data is very inefficient • Very difficult for general user to configure and use – need some kind of higher level abstraction – allow user to focus on data layout across user processes – don’t want to worry about how file is split across IO servers Parallel Filesystems

Parallel File Systems: Lustre Parallel Filesystems

ARCHER’s Cray Sonexion Storage MMU: Metadata Management Unit Lustre MetaData Server Contains server hardware and storage ● SSU: Scalable Storage Unit 2 x OSSs and 8 x OSTs (Object Storage Targets) – Contains Storage controller, Lustre server, disk controller Multiple SSUs are combined to form and RAID engine storage racks – Each unit is 2 OSSs each with 4 OSTs of 10 (8+2) disks in a RAID6 array 11 Parallel Filesystems

ARCHER’s File systems Connected to the Cray XC30 via LNET router service Infiniband nodes. Network /fs4 /fs3 /fs2 6 SSUs 7 SSUs 6 SSUs 12 OSSs 14 OSSs 12 OSSs 48 OSTs 56 OSTs 48 OSTs 480 HDDs 560 HDDs 480 HDDs 4TB per 4TB per 4TB per HDD HDD HDD 1.4 PB Total 1.6 PB Total 1.4 PB Total Parallel Filesystems

Lustre data striping Lustre’s performance comes from striping files over multiple OSTs Single logical user OS/file-system file e.g. automatically Stripes are then /work/y02/y02 divides the file into read/written to/from their /ted stripes assigned OST 13 Parallel Filesystems

Configuring Lustre • Main control is the number of OSTs a file is striped across – default 4 stripes (i.e. file is stored across 4 OSTs) in 1 Mb chunks – under control of user – easiest to set this on a per-directory basis • lfs setstripe – c <stripecount> directory – stripecount = 4 is default – stripecount = 1 is appropriate for many small files – stripecount = -1 sets maximum striping (i.e. around 50 OSTs) – appropriate for collective access to a single large file • Can investigate this in practical exercise Parallel Filesystems

Lustre on ARCHER • See white paper on I/O performance on ARCHER: • http://www.archer.ac.uk/documentation/white- papers/parallelIO/ARCHER_wp_parallelIO.pdf Parallel Filesystems 15

GPFS (Spectrum Scale) • IBM G eneral P urpose F ile s ystem – Files broken into blocks, striped over disks – Distributed metadata (including dir tree) – Extended directory indexes – Failure aware (partition based) – Fully POSIX compliant • Storage pools and policies – Groups disks – Tiered on performance, reliability, locality – Policies move and manage data – Active management of data and location – Supports wide range of storage hardware • High performance Parallel Filesystems 16

GPFS cont… • Configuration – Shared disks (i.e. SAN attached to cluster) – Network Shared disks (NSD) using NSD servers – NSD across clusters (higher performance NFS) Parallel Filesystems 17

Configuring GPFS • Little experience so far of GPFS performance on DAC – MPI jobs limited to a single node – not clear what tuning can be done • Previous experience from BlueGene/Q – performance seems to scale well with number of processors – no equivalent of tuning Lustre striping is required Parallel Filesystems

AFS • Andrews Filesystem – Large/wide scale NFS – Distributed, transparent – Designed for scalability • Server caching – File cached local, read and writes done locally – Servers maintain list of open files (callback coherence) – Local and shared files • File locking – Doesn’t support large databases or updating shared files • Kerberos authentication – Access control list on directories for users and groups Parallel Filesystems 19

HDFS • Hadoop distributed file system – Distributed filesystem with built in fault tolerance – Relaxed POSIX implementation to allow data streaming – Optimised for large scale • Java based implementation – Separate data nodes and metadata functionality – Single NameNode performs filesystem name space operations – Similar to Lustre decomposition – Namenode -> MDS server • Block replication undertaken – Namenode “RAIDs” data – Namenode copes with DataNode failures – Heartbeat and status operations Parallel Filesystems 20

Hierarchical storage management • HSM moves data between storage levels based on policies users • Data moved independently of file system users • May be for backup, archive, Fast SSD staging SCSI RAID – Manage expensive fast storage, maintain data in slow, cheap storage Large • Policies may relate to SATA RAID – Time since last access – Fixed time Long term – Events Disk Optical disk Tape Offsite storage Parallel Filesystems

Cellular Automaton Model • Fortran coarray library for 3D cellular automata microstructure simulation , Anton Shterenlikht, proceedings of 7 th International Conference on PGAS Programming Models, 3-4 October 2013, Edinburgh, UK. Parallel Filesystems

Benchmark • Distributed regular 3D dataset across 3D process grid – set up for weak scaling – fixed local arrays, e.g. 128x128x128 – replicated across processes – implemented in Fortran and MPI-IO, HDF5, NetCDF Parallel Filesystems

Parallel vs serial IO, default Lustre 14/03/2016 Parallel Filesystems 24

Results on ARCHER 14/03/2016 Parallel Filesystems 25

Data Management Parallel Filesystems Dr David Henty HPC Training - PowerPoint PPT Presentation

Data Management Parallel Filesystems Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Modern Data Management and Governance Benjamin Pecheux Data Management and Governance for Better

Data Management Week 14 Why Focus on Data Management? Lots of data to keep track of in many

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

DATA MANAGEMENT: A TOOL FOR RESOURCE MANAGEMENT Africa Petroleum Data Management (APDM) Forum

St St Storm Water Storm Water W t W t Management Management Management Management Program

Adaptive Management: Adaptive Management: Science, Management, or What? Science, Management, or

Formal Dependability Modeling and Analysis: A Survey Waqar Ahmed and Osman Hasan School of

COMBO Architecture Demo Day Lannion 28th of April This presentation is property of the COMBO

Photonics in Telecom Satellite Platforms and Launchers Iain McKenzie Trieste 20/02/2015 ESA

FF-LYNX (*): Fast and Flexible protocols and interfaces for data transmission and distribution of

Disk Storage Systems CloudPlus Ch2 Topics Disk Storage Systems Disk Types and

CRIIS Program Update for ITEA Test & Training Instrumentation Workshop Dennis Quirao On

Scientific computing challenges at ESS High event rate >10^7 8GB/min/instrument average

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

Sambuz

Useful Links

Newsletter

Mail Us

Data Management Parallel Filesystems Dr David Henty HPC Training - PowerPoint PPT Presentation

Data Management Parallel Filesystems Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Modern Data Management and Governance Benjamin Pecheux Data Management and Governance for Better

Data Management Week 14 Why Focus on Data Management? Lots of data to keep track of in many

Business Statistics CONTENTS The role of data The data matrix Data types Aspects of data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

DATA MANAGEMENT: A TOOL FOR RESOURCE MANAGEMENT Africa Petroleum Data Management (APDM) Forum

St St Storm Water Storm Water W t W t Management Management Management Management Program

Adaptive Management: Adaptive Management: Science, Management, or What? Science, Management, or

Formal Dependability Modeling and Analysis: A Survey Waqar Ahmed and Osman Hasan School of

COMBO Architecture Demo Day Lannion 28th of April This presentation is property of the COMBO

Photonics in Telecom Satellite Platforms and Launchers Iain McKenzie Trieste 20/02/2015 ESA

FF-LYNX (*): Fast and Flexible protocols and interfaces for data transmission and distribution of

Disk Storage Systems CloudPlus Ch2 Topics Disk Storage Systems Disk Types and

CRIIS Program Update for ITEA Test &amp; Training Instrumentation Workshop Dennis Quirao On

Scientific computing challenges at ESS High event rate &gt;10^7 8GB/min/instrument average

Disk Drive Schematic Disk Drive Schematic Typically 512 bytes Typically 512 bytes reads by sensing

Sambuz

Useful Links

Newsletter

Mail Us

CRIIS Program Update for ITEA Test & Training Instrumentation Workshop Dennis Quirao On

Scientific computing challenges at ESS High event rate >10^7 8GB/min/instrument average