Programming Overview of Parallel IO 1 ARCHER Training Courses - PowerPoint PPT Presentation

Advanced Parallel Programming Overview of Parallel IO 1

ARCHER Training Courses Sponsors

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images. 3

Overview • Lecture will cover - Why is IO difficult - Why is parallel IO even worse - Straightforward solutions in parallel - What is parallel IO trying to achieve? - Files as arrays - MPI-IO and derived data types 4

Why is IO hard? • Breaks out of the nice process/memory model - data in memory has to physically appear on an external device • Files are very restrictive - linear access probably implies remapping of program data - just a string of bytes with no memory of their meaning • Many, many system-specific options to IO calls • Different formats - text, binary, big/little endian, Fortran unformatted, ... • Disk systems are very complicated - RAID disks, many layers of caching on disk, in memory, ... • IO is the HPC equivalent of printing! 5

Why is Parallel IO Harder? • Cannot have multiple processes writing a single file - Unix generally cannot cope with this - data cached in units of disk blocks (eg 4K) and is not coherent - not even sufficient to have processes writing to distinct parts of file • Even reading can be difficult - 1024 processes opening a file can overload the filesystem (fs) • Data is distributed across different processes - processes do not in general own contiguous chunks of the file - cannot easily do linear writes - local data may have halos to be stripped off 6

Simultaneous Access to Files Process 0 Disk cache File Disk block 0 Disk block 1 Disk block 2 Disk cache Process 1 7

Parallel File Systems • Parallel computer - constructed of many processors - each processor not particularly fast - performance comes from using many processors at once - requires distribution of data and calculation across processors • Parallel file systems - constructed from many standard disk - performance comes from reading / writing to many disks - requires many clients to read / write to different disks at once - data from a single file must be striped across many disks • Must appear as a single file system to user - typically have a single MedaData Server (MDS) - can become a bottleneck for performance 8

Parallel File Systems: Lustre 9

Lustre data striping Lustre’s performance comes from striping files over multiple OSTs Single logical user OS/file-system file e.g. automatically Stripes read/written to/from /work/y02/y02/ted divides the file into their assigned OST stripes 10

Parallel File Systems • Allow multiple IO processes to access same file – increases bandwidth • Typically optimised for bandwidth – not for latency – e.g. reading/writing small amounts of data is very inefficient • Very difficult for general user to configure and use – need some kind of higher level abstraction – allow user to focus on data layout across user processes – don’t want to worry about how file is split across IO servers 11

4x4 array on 2x2 Process Grid 2 4 2 4 Parallel Data 1 3 1 3 2 4 2 4 1 3 1 3 File 1 2 1 2 3 4 3 4 1 2 1 2 3 4 3 4 12

Shared Memory • Easy to solve in shared memory - imagine a shared array called x begin serial region open the file write x to the file close the file end serial region • Simple as every thread can access shared data - may not be efficient but it works • But what about message-passing? 13

Message Passing: Naive Solutions • Master IO - send all data to/from master and write/read a single file - quickly run out of memory on the master • or have to write in many small chunks - does not benefit from a parallel fs that supports multiple write streams • Separate files - each process writes to a local fs and user copies back to home - or each process opens a unique file (dataXXX.dat) on shared fs • Major problem with separate files is reassembling data - file contents dependent on number of CPUs and decomposition - pre / post-processing steps needed to change number of processes - but at least this approach means that reads and writes are in parallel • But may overload filesystem for many processes - e.g. MDS cannot keep up with requests 14

2x2 to 1x4 Redistribution data4.dat 11 12 15 16 4 8 12 16 data3.dat 9 10 13 14 3 7 11 15 write data2.dat 2 6 10 14 3 4 7 8 1 5 9 13 data1.dat 1 2 5 6 reorder 4 8 12 16 newdata4.dat newdata3.dat 3 7 11 15 read newdata2.dat 2 6 10 14 newdata1.dat 1 5 9 13 15

What do we Need? • A way to do parallel IO properly - where the IO system deals with all the system specifics • Want a single file format - We already have one: the serial format • All files should have same format as a serial file - entries stored according to position in global array • not dependent on which process owns them - order should always be 1, 2, 3, 4, ...., 15, 16 16

Information on Machine • What does the IO system need to know about the parallel machine? - all the system-specific fs details - block sizes, number of IO servers, etc. • All this detail should be hidden from the user - but the user may still wish to pass system- specific options … 17

ARCHER’s Cray Sonexion Storage MMU: Metadata Management Unit Lustre MetaData Server Contains server hardware and storage ● SSU: Scalable Storage Unit 2 x OSSs and 8 x OSTs (Object Storage Targets) Contains Storage controller, Lustre server, disk controller - Multiple SSUs are combined to form and RAID engine storage racks Each unit is 2 OSSs each with 4 OSTs of 10 (8+2) disks in a - RAID6 array 1 8

ARCHER’s File systems Connected to the Cray XC30 via LNET router service Infiniband nodes. Network /fs4 /fs3 /fs2 6 SSUs 7 SSUs 6 SSUs 12 OSSs 14 OSSs 12 OSSs 48 OSTs 56 OSTs 48 OSTs 480 HDDs 560 HDDs 480 HDDs 4TB per 4TB per 4TB per HDD HDD HDD 1.4 PB Total 1.6 PB Total 1.4 PB Total 19

Information on Data Layout • What does the IO system need to know about the data? - how the local arrays should be stitched together to form the file • But ... - mapping from local data to the global file is only in the mind of the programmer! - the program does not know that we imagine the processes to be arranged in a 2D grid • How do we describe data layout to the IO system - without introducing a whole new concept to MPI? - cartesian topologies are not sufficient • do not distinguish between block and block-cyclic decompositions 20

Programmer View vs Machine View 4 8 12 16 1 2 3 4 1 2 3 4 Process 2 3 7 11 15 Process 4 2 6 10 14 1 2 3 4 1 5 9 13 Process 3 1 2 3 4 Process 1 21

Files vs Arrays • Think of the file as a large array - forget that IO actually goes to disk - imagine we are recreating a single large array on a master process • The IO system must create this array and save to disk - without running out of memory • never actually creating the entire array • ie without doing naive master IO - and by doing a small number of large IO operations • merge data to write large contiguous sections at a time - utilising any parallel features • doing multiple simultaneous writes if there are multiple IO nodes • managing any coherency issues re file blocks 22

MPI-IO Approach • MPI-IO is part of the MPI standard - http://www.mpi-forum.org/docs/docs.html • Each process needs to describe what subsection of the global array it holds - it is entirely up to the programmer to ensure that these do not overlap for write operations! • Programmer needs to be able to pass system-specific information - pass an info object to all calls 23

Data Sections 4 8 12 16 4 8 12 16 3 7 11 15 3 7 11 15 on process 3 2 6 10 14 2 6 10 14 1 5 9 13 1 5 9 13 • Describe 2x2 subsection of 4x4 array • Using standard MPI derived datatypes • A number of different ways to do this - we will cover three methods in the course 24

Summary • Parallel IO is difficult - in theory and in practice • MPI-IO provides a higher-level abstraction - user describes global data layout using derived datatypes - MPI- IO hides all the system specific fs details … - … but (hopefully) takes advantage of them for performance • More flexible formats like NetCDF and HDF5 exist - they gain performance by layering on top of MPI-IO • User requires a good understanding of derived datatypes - see next lecture 25

Programming Overview of Parallel IO 1 ARCHER Training Courses - PowerPoint PPT Presentation

Advanced Parallel Programming Overview of Parallel IO 1 ARCHER Training Courses Sponsors Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

voice Kate Howland End-user programming? End-user programming? End-user programming?

Hierarchy of Software Complexity Application Programs Sequential Programming Embedded

Programming Styles and Objects Fermilab - TARGET 2018 Week 3 Programming styles Imperative

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

CS2281: Programming in UNIX Semester 3, 2004/05 CS2281: Programming in UNIX p.1/13 Syllabus

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

? P12 2 Getting Started/Lab Programming Lab Programming Program of Requirements PRELIMINARY

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

MATHEMATICS 1 CONTENTS Mathematical programming Linear programming The LP-problem Old exam

Network Programming Network Programming as Programming across Machine Boundaries The

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

Memory Hierarchy (Performance OpAmizaAon) 2 Lab Schedule

15-721 DATABASE SYSTEMS [Image Source] Lecture #02 In-Memory Databases Andy Pavlo / /

MIDAS: An Execution-Driven Simulator for Active Storage Architectures Shahrukh R. Tarapore

Chapter 14: Mass-Storage Systems ! Disk Structure ! Disk Scheduling ! Disk Management ! Swap-Space

Tree Cache Learning Or, What I Did this Summer Jack Weinstein Argonne National Laboratories

Changelog Changes made in this version not seen in fjrst lecture: 19 March 2019: tmeporarily

The UNIX Time- Sharing System Dennis M. Ritchie and Ken Thompson Bell Laboratories

Openshift di Massimiliano Dessi Openshift Massimiliano Dessi The linuxday 2014 in

Sambuz

Useful Links

Newsletter

Mail Us