Parallel IO These slides are possible thanks to these sources - - PowerPoint PPT Presentation

parallel io
SMART_READER_LITE
LIVE PREVIEW

Parallel IO These slides are possible thanks to these sources - - PowerPoint PPT Presentation

Parallel IO These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Parallel I/O Tutorial, Argonne National Labs: HPC I/O for Computational Scientists,TACC/Cornell MPI/IO Tutorial, NeRSC Lustre Notes; Quincey


slide-1
SLIDE 1

nci.org.au nci.org.au

@NCInews

Parallel IO

These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Parallel I/O Tutorial, Argonne National Labs: HPC I/O for Computational Scientists,TACC/Cornell MPI/IO Tutorial, NeRSC Lustre Notes; Quincey Koziol – HDF Group

slide-2
SLIDE 2

nci.org.au

References

  • eBook: High Performance Parallel I/O

– Chapter 8: Lustre – Chapter 13: MPI/IO – Chapter 15: HDF5

  • HPC I/O For Computational Scientists

(YouTube); Slides

  • Parallel IO Basics - Paper
  • eBook: Memory Systems - Cache, DRAM,

Disk – Bruce Jacob

slide-3
SLIDE 3

nci.org.au

The Advent of Big Data

  • Big Data refers to datasets and flows large enough

that have outpaced our capability to store, process, analyze and understand

– Increase in computing power makes simulations larger and more frequent – Increase in sensor technology resolution creates larger observation data points

  • Data sizes that once used to be measured in MBs or

GBs now measured in TBs or PBs

  • Easier to generate the data than to store it
slide-4
SLIDE 4

nci.org.au

The Four V’s

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

slide-5
SLIDE 5

nci.org.au

BIG DATA PROJECTS AT THE NCI

slide-6
SLIDE 6

nci.org.au

ESA’s Sentinel Constellation

We care for a safer world
  • Sentinel-1 systematic observation scenario in one/two main high rate

modes of operation will result in significanlty large acquisition segments (data takes of few minutes)

  • 25min in high rate modes leads to about 2.4 TBytes of compressed raw

data per day for the 2 satellites

  • Wave Mode operated continuously over ocean where high rate modes are

not used

6 min IW 15 min IW 2 min IW

Sentinel-1 observation scenario and impact on data volumes

16 GB for SLC 4 GB for GRD-HR 46 GB for SLC 12 GB for GRD-HR

slide-7
SLIDE 7

nci.org.au

ESA’s Sentinel Constellation

We care for a safer world
  • Sentinel-1 systematic observation scenario in one/two main high rate

modes of operation will result in significanlty large acquisition segments (data takes of few minutes)

  • 25min in high rate modes leads to about 2.4 TBytes of compressed raw

data per day for the 2 satellites

  • Wave Mode operated continuously over ocean where high rate modes are

not used

6 min IW 15 min IW 2 min IW

Sentinel-1 observation scenario and impact on data volumes

16 GB for SLC 4 GB for GRD-HR 46 GB for SLC 12 GB for GRD-HR

Sentinel-1s 2.4TB/day Sentinel-2s 1.6TB/day (High-Res Optical Land Monitoring) Sentinel-3s providing 0.6 TB/day (Land+Marine Observation)

slide-8
SLIDE 8

nci.org.au

Nepal Earthquake Inteferogram using Sentinel SAR Data

slide-9
SLIDE 9

nci.org.au nci.org.au

@NCInews

Data Storage at NCI

slide-10
SLIDE 10

nci.org.au

Data Storage Subsystems at the NCI

  • The NCI compute and data environments

allow researchers the ability to seamlessly work with HPC and Cloud based compute cycles, while have unified data storage

  • How is this done?
slide-11
SLIDE 11

nci.org.au

NCI’s integrated high-performance environment

10 GigE /g/data 56Gb FDR IB Fabric

/g/data1

~6.3PB

/g/data2

~3.1PB

/short

7.6PB /home, /system, /images, /apps

Cache 1.0PB, Tape 12.3PB

Massdata (tape) Persistent global parallel filesystem Raijin high-speed filesystem Raijin HPC Compute Raijin Login + Data movers VMware Cloud NCI data movers

To Huxley DC Raijin 56Gb FDR IB Fabric

Internet

slide-12
SLIDE 12

nci.org.au

HARDWARE TRENDS

slide-13
SLIDE 13

nci.org.au

Disk and CPU Performance

HPCS2012

Disk (MB/s), CPU (MIPS)

Di “ ma techno Tho Bes Ha

!

M Wi “ Seco

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-14
SLIDE 14

nci.org.au

Disk and CPU Performance

HPCS2012

Disk (MB/s), CPU (MIPS)

Di “ ma techno Tho Bes Ha

!

M Wi “ Seco

1000x

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-15
SLIDE 15

nci.org.au

Memory and Storage Latency

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-16
SLIDE 16

nci.org.au

Assessing Storage Performance

  • Data Rate – MB/sec

–Peak or sustained –Writes are faster than reads

  • IOPS – IO Operations Per Second

–open(), close(), seek(), read(), write()

slide-17
SLIDE 17

nci.org.au

Assessing Storage Performance

  • Data Rate – MB/sec

–Peak or sustained –Writes are faster than reads

  • IOPS – IO Operations Per Second

–open(), close(), seek(), read(), write()

Lab – measuring MB/s and IOPS

slide-18
SLIDE 18

nci.org.au

Storage Performance

  • Data Rate – MB/sec

– Peak or sustained – Writes are faster than reads

  • IOPS – IO Operations Per Second

– open(), close(), seek(), read(), write()

Device Bandwidth (MB/s) IOPS SATA HDD 100 100 SSD 250 10000

HD: ! Open, Write, Close 1000x1kB files: 30.01s (eff: 0.033 MB/s)! Open, Write, Close 1x1MB file: 40ms (eff: 25 MB/s)

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-19
SLIDE 19

nci.org.au

Storage Performance

  • Data Rate – MB/sec

– Peak or sustained – Writes are faster than reads

  • IOPS – IO Operations Per Second

– open(), close(), seek(), read(), write()

Device Bandwidth (MB/s) IOPS SATA HDD 100 100 SSD 250 10000

SSD: ! Open, Write, Close 1000x1kB files: 300ms (eff: 3.3 MB/s)! Open, Write, Close 1x1MB file: 4ms (eff: 232 MB/s)

SSDs better at IOPS – no moving parts Latency at controller, system calls etc. SSDs are still very

  • expensive. Disk to

stay!

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-20
SLIDE 20

nci.org.au

Storage Performance

  • Data Rate – MB/sec

– Peak or sustained – Writes are faster than reads

  • IOPS – IO Operations Per Second

– open(), close(), seek(), read(), write()

Device Bandwidth (MB/s) IOPS SATA HDD 100 100 SSD 250 10000

SSD: ! Open, Write, Close 1000x1kB files: 300ms (eff: 3.3 MB/s)! Open, Write, Close 1x1MB file: 4ms (eff: 232 MB/s)

SSDs better at IOPS – no moving parts Latency at controller, system calls etc. SSDs are still very

  • expensive. Disk to

stay! Raijin /short – aggregate – 150GB/sec (writes), 120GB/sec (reads) 5 DDN SFA12K arrays for /short, each is capable of 1.3M read IOPS; 700,000 write IOPS yielding a total

  • f 6.5M read IOPS and 3.5M write IOPS

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-21
SLIDE 21

nci.org.au

Applications (processes)

VFS

Request-based device mapper targets dm-multipath Physical devices HDD SSD DVD drive Micron PCIe card LSI RAID Adaptec RAID Qlogic HBA Emulex HBA malloc BIOs (block I/Os) sysfs (transport attributes) SCSI upper level drivers /dev/sda scsi-mq ... /dev/sd* SCSI low level drivers megaraid_sas aacraid qla2xxx ... libata ahci ata_piix ... lpfc Transport classes scsi_transport_fc scsi_transport_sas scsi_transport_... /dev/vd* virtio_blk mtip32xx /dev/rssd* ext2 ext3 btrfs ext4 xfs ifs iso9660 ... NFS coda Network FS gfs
  • cfs
smbfs ... Pseudo FS Special purpose FS proc sysfs futexfs usbfs ... tmpfs ramfs devtmpfs pipefs network nvme device

The Linux Storage Stack Diagram

version 4.10, 2017-03-10

  • utlines the Linux storage stack as of Kernel version 4.10
mmap (anonymous pages) iscsi_tcp network /dev/rbd* Block-based FS read(2) write(2)
  • pen(2)
stat(2) chmod(2) ... Page cache mdraid ... stackable Devices on top of “normal” block devices drbd (optional) LVM BIOs (block I/Os) BIOs BIOs

Block Layer

multi queue blkmq Software queues Hardware dispatch queues ... ... hooked in device drivers (they hook in like stacked devices do) BIOs Maps BIOs to requests deadline cfq noop I/O scheduler Hardware dispatch queue Request based drivers BIO based drivers Request based drivers ceph struct bio
  • sector on disk
  • bio_vec cnt
  • bio_vec index
  • bio_vec list
  • sector cnt
Fibre Channel
  • ver Ethernet

LIO

target_core_mod tcm_fc FireWire ISCSI Direct I/O (O_DIRECT) device mapper network iscsi_target_mod sbp_target target_core_file target_core_iblock target_core_pscsi vfs_writev, vfs_readv, ... dm-crypt dm-mirror dm-thin dm-cache tcm_qla2xxx tcm_usb_gadget USB Fibre Channel tcm_vhost Virtual Host /dev/nvme*n* SCSI mid layer virtio_pci LSI 12Gbs SAS HBA mpt3sas bcache /dev/nullb* vmw_pvscsi /dev/skd* skd stec device virtio_scsi para-virtualized SCSI VMware's para-virtualized SCSI target_core_user unionfs FUSE /dev/mmcblk*p* dm-raid /dev/sr* /dev/st* pm8001 PMC-Sierra HBA SD-/MMC-Card /dev/rsxx* rsxx IBM flash adapter /dev/zram* memory null_blk ufs userspace ecryptfs Stackable FS mobile device flash memory nvme
  • verlayfs
userspace (e.g. sshfs) mmc rbd zram dm-delay /dev/nbd* nbd /dev/ubiblock* ubi /dev/loop* loop
slide-22
SLIDE 22

nci.org.au

Applications (processes)

VFS

malloc BIOs (block I/Os) ext2 ext3 btrfs ext4 xfs ifs iso9660 ... NFS coda Network FS gfs

  • cfs

smbfs ... Pseudo FS Special purpose FS proc sysfs futexfs usbfs ... tmpfs ramfs devtmpfs pipefs network

The Linux Storage Stack Diagram

version 4.10, 2017-03-10

  • utlines the Linux storage stack as of Kernel version 4.10

mmap (anonymous pages)

Block-based FS

read(2) write(2)

  • pen(2)

stat(2) chmod(2) ... Page cache mdraid ... stackable Devices on top of “normal” block devices drbd (optional) LVM BIOs (block I/Os) ceph struct bio

  • sector on disk
  • bio_vec cnt
  • bio_vec index
  • bio_vec list
  • sector cnt

Fibre Channel

  • ver Ethernet

LIO

target_core_mod tcm_fc

FireWire ISCSI Direct I/O (O_DIRECT) device mapper iscsi_target_mod sbp_target target_core_file target_core_iblock target_core_pscsi vfs_writev, vfs_readv, ...

dm-crypt dm-mirror dm-thin dm-cache

tcm_qla2xxx tcm_usb_gadget USB Fibre Channel tcm_vhost Virtual Host bcache target_core_user

unionfs

FUSE

dm-raid

fl userspace

ecryptfs

Stackable FS fl

  • verlayfs

userspace (e.g. sshfs)

dm-delay
slide-23
SLIDE 23

nci.org.au

BIOs (block I/Os) ... gfs

  • cfs

usbfs ... devtmpfs network mdraid ... stackable Devices on top of “normal” block devices drbd (optional) LVM BIOs (block I/Os) BIOs BIOs

Block Layer

multi queue

blkmq

Software queues Hardware dispatch queues ... ... hooked in device drivers (they hook in like stacked devices do) BIOs Maps BIOs to requests deadline cfq noop

I/O scheduler

Hardware dispatch queue Request based drivers BIO based drivers Request based drivers ceph struct bio

  • sector on disk
  • bio_vec cnt
  • bio_vec index
  • bio_vec list
  • sector cnt

device mapper target_core_file target_core_iblock target_core_pscsi

dm-crypt dm-mirror dm-thin dm-cache

bcache target_core_user

unionfs

FUSE

dm-raid

fl userspace

ecryptfs

Stackable FS fl

  • verlayfs

userspace (e.g. sshfs)

dm-delay
slide-24
SLIDE 24

nci.org.au

Request-based device mapper targets dm-multipath

Physical devices

HDD SSD DVD drive Micron PCIe card LSI RAID Adaptec RAID Qlogic HBA Emulex HBA sysfs (transport attributes) SCSI upper level drivers /dev/sda scsi-mq

...

/dev/sd*

SCSI low level drivers

megaraid_sas aacraid qla2xxx ... libata ahci ata_piix ... lpfc

Transport classes

scsi_transport_fc scsi_transport_sas scsi_transport_... /dev/vd* virtio_blk mtip32xx /dev/rssd* nvme device iscsi_tcp network /dev/rbd* dispatch queues ... Hardware dispatch queue Request based drivers BIO based drivers Request based drivers network fi /dev/nvme*n*

SCSI mid layer

virtio_pci LSI 12Gbs SAS HBA mpt3sas /dev/nullb* vmw_pvscsi /dev/skd* skd stec device virtio_scsi para-virtualized SCSI VMware's para-virtualized SCSI /dev/mmcblk*p* /dev/sr* /dev/st* pm8001 PMC-Sierra HBA SD-/MMC-Card /dev/rsxx* rsxx IBM flash adapter /dev/zram* memory null_blk ufs mobile device flash memory nvme mmc rbd zram /dev/nbd* nbd /dev/ubiblock* ubi /dev/loop* loop

slide-25
SLIDE 25

nci.org.au

slide-26
SLIDE 26

nci.org.au

HPC IO

slide-27
SLIDE 27

nci.org.au

Scientific I/O

  • I/O is commonly used by scientific applications to achieve goals like

– Storing numerical output from simulations for later analysis – Implementing 'out-of-core' techniques for algorithms that process more data than can fit in system memory and must page data in from disk – Checkpointing to files that save the state of an application in case of system failure

  • In most cases, scientific applications write large amounts of data in a structured or

sequential 'append-only' way that does not overwrite previously written data or require random seeks throughout the file

– Having said that, there are seeky workloads like graph traversal and bioinformatics problems

  • Most HPC systems are equipped with a parallel file system such as Lustre or GPFS that

abstracts away spinning disks, RAID arrays, and I/O subservers to present the user with a simplified view of a single address space for reading and writing to files

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-28
SLIDE 28

nci.org.au

Data Access in Current Large-Scale Systems

Current systems have greater support on the logical side, more complexity on the physical side.

I/O Hardware Application Files (POSIX) I/O Transform Layer(s) Data Model Library

SAN and RAID Enclosures

Compute Node Memory

Internal System Network(s)

Data Movement Logical (data model) view of data access. Physical (hardware) view of data access.

I/O Gateways External Sys. Network(s)

I/O Servers

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-29
SLIDE 29

nci.org.au

Common Methods for Access Data In Parallel

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-30
SLIDE 30

nci.org.au

Common Methods for Access Data In Parallel

  • Simplest to implement – each processor has its own

file-handle and works independently of other nodes

  • PFS perform well on this type of IO, but this creates

a metadata bottleneck

– ls breaks for example

  • Another downside is that program restarts are now

dependent on the getting the same processor layout

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-31
SLIDE 31

nci.org.au

Common Methods for Access Data In Parallel

  • Many processors share the same file-handle, but write to their own

distinct sections of a shared file

  • If there are shared regions of files, then a locking manager is used to

serialize access

– For large O(N), the locking is an impediment to performance – Even in ideal cases where the file system is guaranteed that processors are writing to exclusive regions, shared file performance can be lower compared to file-per-processor

  • The advantage of shared file access lies in data management and

portability, especially when a higher-level I/O format such as HDF5

  • r netCDF is used to encapsulate the data in the file

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-32
SLIDE 32

nci.org.au

Common Methods for Access Data In Parallel

  • Collective buffering is a technique used to improve the performance of

shared-file access by offloading some of the coordination work from the file system to the application

– A subset of the processors is chosen to be the 'aggregators' – These collect data from other processors and pack it into contiguous buffers in memory that are then written to the file system

  • Reducing the number of processors that interact with the I/O subservers

reduces PFS contention

  • Originally, developed to reduce the number of small, noncontiguous writes

– Another benefit that is important for file systems such as Lustre is that the buffer size can be set to a multiple of the ideal transfer size preferred by the file system

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-33
SLIDE 33

nci.org.au

HPC IO – How it works

HPC I/O systems provide a file system view of stored data

– File (i.e., POSIX) model of access – Shared view of data across the system – Access to same data from the outside (e.g., login nodes, data movers)

Topics:

– How is data stored and organized? – What support is there for application data models? – How does data move from clients to servers? – How is concurrent access managed? – What transformations are typically applied?

File system view consists of directories (a.k.a. folders) and files. Files are broken up into regions called extents or blocks.

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-34
SLIDE 34

nci.org.au

Storing and Organizing Data: Storage Model

HPC I/O systems are built around a parallel file system that

  • rganizes storage and manages access.

Parallel file systems (PFSes) are distributed systems that provide a file data model (i.e., files and directories) to users Multiple PFS servers manage access to storage, while PFS client systems run applications that access storage PFS clients can access storage resources in parallel!

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-35
SLIDE 35

nci.org.au

Reading and Writing Data to a PFS

PFS servers manage local storage, services incoming requests from clients. PFS client software requests operations on behalf of applications. Requests are sent as messages (RPC-like), often to multiple servers. Requests pass over the interconnect, thus each request incurs some latency. RAID enclosures protect against individual disk failures and map regions of data onto specific devices.

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-36
SLIDE 36

nci.org.au

Data Distribution in Parallel File Systems

Distribution across multiple servers allows concurrent access.

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-37
SLIDE 37

nci.org.au

Request Size and IO Rate

Interconnect latency has a significant impact on effective rate

  • f I/O. Typically I/Os should be in the O(Mbytes) range.

Tests run on 2K processes of IBM Blue Gene/P at ANL.

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-38
SLIDE 38

nci.org.au

Request Size and IO Rate

Interconnect latency has a significant impact on effective rate

  • f I/O. Typically I/Os should be in the O(Mbytes) range.

Tests run on 2K processes of IBM Blue Gene/P at ANL.

Why are writes faster than reads?

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-39
SLIDE 39

nci.org.au

Where, how you do I/O matters.

  • Binary - smaller files, much

faster to read/write.

  • You’re not going to read GB/

TB of data yourself; don’t bother trying.

  • Write in 1 chunk, rather than

a few #s at a time.

Large Parallel File System ASCII binary

173s 6s

Ramdisk ASCII binary

174s 1s

Typical work station disk ASCII binary

260s 20s

Timing data: writing 128M double-precision numbers

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-40
SLIDE 40

nci.org.au

Where, how you do I/O matters.

  • All disk systems do best when

reading/writing large, contiguous chunks

  • I/O operations (IOPS) are

themselves expensive

  • moving around within a file
  • opening/closing
  • Seeks - 3-15ms - enough time

to read 0.75 MB!

Typical work station disk binary - one large read

14s

binary - 8k at a time

20s

binary - 8k chunks, lots of seeks

150s

binary - seeky + open and closes

205s

Timing data: reading 128M double-precision numbers

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-41
SLIDE 41

nci.org.au

Where, how you do I/O matters.

  • RAM is much better for

random accesses

  • Use right storage medium for

the job!

  • Where possible, read in

contiguous large chunks, do random access in memory

  • Much better if you use most
  • f data read in

Large Parallel File System ASCII binary

173s 6s

Ramdisk ASCII binary

174s 1s

Typical work station disk ASCII binary

260s 20s

Ramdisk binary - one large read

1s

binary - 8k at a time

1s

binary - 8k chunks, lots of seeks

1s

binary - seeky + open and closes

1.5s

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-42
SLIDE 42

nci.org.au

Where, how you do I/O matters.

  • Well built parallel file systems can

greatly increase bandwidth

  • Many pipes to network (servers),

many spinning disks (bandwidth off

  • f disks)
  • But typically even worse penalties

for seeky/IOPSy operations (coordinating all those disks.)

  • Parallel FS can help with big data in

two ways

Large Parallel File System ASCII binary

173s 6s

Ramdisk ASCII binary

174s 1s

Typical work station disk ASCII binary

260s 20s

Large Parallel File System binary - one large read

7.5s

binary - 8k at a time

62 s

binary - 8k chunks, lots of seeks

428 s

binary - seeky + open and closes

2137 s

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-43
SLIDE 43

nci.org.au

Where, how you do I/O matters.

  • Well built parallel file systems can

greatly increase bandwidth

  • Many pipes to network (servers),

many spinning disks (bandwidth off

  • f disks)
  • But typically even worse penalties

for seeky/IOPSy operations (coordinating all those disks.)

  • Parallel FS can help with big data in

two ways

Large Parallel File System ASCII binary

173s 6s

Ramdisk ASCII binary

174s 1s

Typical work station disk ASCII binary

260s 20s

Large Parallel File System binary - one large read

7.5s

binary - 8k at a time

62 s

binary - 8k chunks, lots of seeks

428 s

binary - seeky + open and closes

2137 s

Striping Multiple readers + writers

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-44
SLIDE 44

nci.org.au

LAB 1

slide-45
SLIDE 45

nci.org.au

Lab 1 - Measuring IO Performance

  • Specifically: Measure IO performance (IOPS,

streaming b/w and latency);

  • Objective 1: Learning to use the FIO tool to measure

IOPS and streaming bandwidth;

  • Objective 2: Learning to use the ioping tool to

measure IO latency;

  • Objective 3: Data collection and analysis;
slide-46
SLIDE 46

nci.org.au

Lab 1 – Background – Tools

  • What is FIO?

fio is a tool that will spawn a number of threads or processes doing a particular type of io action as specified by the user

  • What is ioping?

A tool to monitor I/O latency in real time. It shows disk latency in the same way as ping shows network latency

slide-47
SLIDE 47

nci.org.au

Lab 1 – ioping Tasks – 1/3

a) Log into Raijin and in your /short/c04 area create a directory called Lab1, then check out the following Git repos: a.1) FIO: https://github.com/axboe/fio.git a.2) ioping: https://github.com/koct9i/ioping.git b) Build both repos for FIO and ioping in your raijin:/short/c37 area

slide-48
SLIDE 48

nci.org.au

Lab 1 – ioping Tasks – 2/3

c) Raijin has three filesystems accessible to end-users: /home, /jobfs and /short c.1) Measuring latency for single threaded sequential workloads Using ioping, measure the IO latency for /short. Construct PBS jobs or use the express queue to use 1 CPU and run the ioping executable and record latency (in milliseconds), IOPS and B/W for block sizes of 4KB, 128KB and 1MB and working sets of 10MB, 100MB, 1024MB. An example run: % /short/z00/jxa900/ioping -c 20 -s 8KB -C -S 1024MB /home/900/jxa900 # run ioping for 20 requests, 8KB block size and for working set of 1GB on my home directory ... 8 KiB from /home/900/jxa900 (lustre 10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys): request=20 time=29 us

  • -- /home/900/jxa900 (lustre 10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys) ioping statistics -
  • 20 requests completed in 2.79 ms, 160 KiB read, 7.16 k iops, 55.9 MiB/s

min/avg/max/mdev = 28 us / 139 us / 289 us / 103 us From the above experiment, the corresponding data values are: Title: Working Set (MB), Block Size (KB), Time Taken (ms), Data Read (KB), IOPS, B/W (MB/sec) Row: 1024, 8, 2.79, 160, 7160, 55.9

slide-49
SLIDE 49

nci.org.au

Lab 1 – ioping Tasks – 3/3

From the previous experiment, note down in a table : Can you explain the observed trends for IOPS and B/W for the three working set sizes, on changing the request block size? Extra credit: Run this Lab on a Raijin compute nodes’ /jobfs and compare

Working Set (MB) Block Size (KB) Time Taken (ms) Data Read (KB) IOPS B/W (MB/sec) 10 4 10 128 10 1024 100 4 100 128 100 1024 1024 4 1024 128 1024 1024

slide-50
SLIDE 50

nci.org.au

Lab 1 – FIO Tasks – 1/2

Using FIO, measure read and write IOPS for two, four and eight threads of IO running on /short for a block sizes 1MB with filesize of 1024MB. These can be set when calling FIO using the --bs=1024M and --size=1024MB flags. For a sequential write workload add –readwrite=write. Fill in below table. SEQUENTIAL WRITE IO PERFORMANCE:

  • No. of

Threads Working Set (MB) Block Size (KB) Time Taken (ms) Data Written (MB) IOPS B/W (MB/sec) 2 1024 1024 4 1024 1024 8 1024 1024

slide-51
SLIDE 51

nci.org.au

Lab 1 – FIO Tasks – 2/2

  • Sample run for random IO:

# Random Workload – set using --readwrite=randrw, ratio of reads and writes set using --rwmixread=50 (i.e. 50% read, 50% writes)

% /short/z00/jxa900/fio-src/fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --bs=4k --iodepth=8 -- size=1024MB --readwrite=randrw --rwmixread=50 --thread --numjobs=2 --name=test -- filename=/short/jxa900/test.out test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 ... fio-2.2.6-15-g3765 Starting 2 threads Jobs: 2 (f=2): [m(2)] [100.0% done] [47712KB/0KB/0KB /s] [11.1K/0/0 iops] [eta 00m:00s] test: (groupid=0, jobs=2): err= 0: pid=7014: Sat Apr 25 18:46:09 2015 mixed: io=2048.0MB, bw=43254KB/s, iops=10813, runt= 48485msec cpu : usr=1.42%, sys=20.51%, ctx=243452, majf=0, minf=5 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): MIXED: io=2048.0MB, aggrb=43253KB/s, minb=43253KB/s, maxb=43253KB/s, mint=48485msec, maxt=48485msec

In the previous table note the data: Title: Num Threads, Working Set (MB), Block Size (KB), Time Taken (ms), Data Read/Written (MB), IOPS, B/W (MB/sec) Row: 2, 1024, 4, 48485, 2048, 10813, 42.2 (=43253/1024)

slide-52
SLIDE 52

nci.org.au

HPC IO

slide-53
SLIDE 53

nci.org.au

HPC IO Software Stack

The software used to provide data model support and to transform I/O to better perform on today’s I/O systems is often referred to as the I/O stack.

Data Model Libraries map application abstractions onto storage abstractions and provide data portability. HDF5, Parallel netCDF, ADIOS I/O Middleware organizes accesses from many processes, especially those using collective I/O. MPI-IO, GLEAN, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contention; and bridges between the HPC system and external storage. IBM ciod, IOFSL, Cray DVS Parallel file system maintains logical file model and provides efficient access to data. PVFS, PanFS, GPFS, Lustre

I/O Hardware Application Parallel File System Data Model Support Transformations

http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf

slide-54
SLIDE 54

nci.org.au

Lustre Components

  • All of Raijin’s filesystems are Lustre, which is

is a distributed filesystem

  • Primary components are the MDSes and
  • OSSes. The OSSes contain data, whereas the

MDSes map these objects into files

slide-55
SLIDE 55

nci.org.au

NCI’s integrated high-performance environment

10 GigE /g/data 56Gb FDR IB Fabric

/g/data1

~6.3PB

/g/data2

~3.1PB

/short

7.6PB /home, /system, /images, /apps

Cache 1.0PB, Tape 12.3PB

Massdata (tape) Persistent global parallel filesystem Raijin high-speed filesystem Raijin HPC Compute Raijin Login + Data movers VMware Cloud NCI data movers

To Huxley DC Raijin 56Gb FDR IB Fabric

Internet

slide-56
SLIDE 56

nci.org.au

Lustre’s File System Architecture

slide-57
SLIDE 57

nci.org.au

Parts of the Lustre System

  • The client (you) must talk to both the MDS and OSS servers in order

to use the Lustre system.

  • File I/O goes to one or more OSS’s. Opening files, listing directories,
  • etc. go to the MDS.
  • Front end to the

Lustre file system is a Logical Object Volume (LOV) that simply appears like any other large volume that would be mounted on a node.

http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf

slide-58
SLIDE 58

nci.org.au

Applications (processes)

VFS

malloc BIOs (block I/Os) ext2 ext3 btrfs ext4 xfs ifs iso9660 ... NFS coda Network FS gfs

  • cfs

smbfs ... Pseudo FS Special purpose FS proc sysfs futexfs usbfs ... tmpfs ramfs devtmpfs pipefs network

The Linux Storage Stack Diagram

version 4.10, 2017-03-10

  • utlines the Linux storage stack as of Kernel version 4.10

mmap (anonymous pages)

Block-based FS

read(2) write(2)

  • pen(2)

stat(2) chmod(2) ... Page cache mdraid ... stackable Devices on top of “normal” block devices drbd (optional) LVM BIOs (block I/Os) ceph struct bio

  • sector on disk
  • bio_vec cnt
  • bio_vec index
  • bio_vec list
  • sector cnt

Fibre Channel

  • ver Ethernet

LIO

target_core_mod tcm_fc

FireWire ISCSI Direct I/O (O_DIRECT) device mapper iscsi_target_mod sbp_target target_core_file target_core_iblock target_core_pscsi vfs_writev, vfs_readv, ...

dm-crypt dm-mirror dm-thin dm-cache

tcm_qla2xxx tcm_usb_gadget USB Fibre Channel tcm_vhost Virtual Host bcache target_core_user

unionfs

FUSE

dm-raid

fl userspace

ecryptfs

Stackable FS fl

  • verlayfs

userspace (e.g. sshfs)

dm-delay

Where do you think the LOV would reside?

slide-59
SLIDE 59

nci.org.au

Lustre File System and Striping

  • Striping allows parts of files to be stored on different OSTs, in a

RAID-0 pattern.

– The number of objects is called the stripe_count. – Objects contain "chunks" of data that can be as large as stripe_size.

http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf

slide-60
SLIDE 60

nci.org.au

How does striping help?

  • Due to striping, the Lustre file system scales with the

number of OSS’s available

  • The capacity of a Lustre file system equals the sum of

the capacities of the storage targets

– Benefit #1: max file size is not limited by the size of a single target. – Benefit #2: I/O rate to a file is the of the aggregate I/O rate to the objects.

  • Raijin provides 6 MDSes and 50 OSSes, capabile of

150GB/sec, but this speed is split by all users of the system

  • Metadata access can be a bottleneck, so the MDS

needs to have especially good performance (e.g., solid state disks on some systems)

slide-61
SLIDE 61

nci.org.au

Striping data across disks

Parallel FS

  • Single client can make use of

multiple disk systems simultaneously

  • “Stripe” file across many drives
  • One drive can be finding next

block while another is sending current block

http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf

slide-62
SLIDE 62

nci.org.au

Lustre on Raijin 1/2

  • Lustre is a scalable, POSIX-compliant parallel

file system designed for large, distributed- memory systems, such as Raijin at NCI

  • It uses a server-client model with separate

servers for file metadata and file content

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-63
SLIDE 63

nci.org.au

Lustre on Raijin 2/2

  • For example, on Raijin, the /short and /gdata{1,2} file systems each

have a single metadata server (which can be a bottleneck when working with thousands of files) and 720, 520, 240 'Object Storage Targets' for /short and /gdata{1,2} respectively that store the contents

  • f files
  • Although Lustre is designed to correctly handle any POSIX-compliant

I/O pattern, in practice it performs much better when the I/O accesses are aligned to Lustre's fundamental unit of storage, which is called a stripe and has a default size (on NCI systems) of 1MB

  • Striping is a method of dividing up a shared file across many OSTs, as

shown below. Each stripe is stored on a different OST, and the assignment of stripes to OSTs is round-robin

  • Striping increases available b/w by using several OSTs in parallel

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1

slide-64
SLIDE 64

nci.org.au

Invoking Striping

  • Among various lfs commands are lfs getstripe and lfs setstripe.
  • The lfs setstripe command takes four arguments:

lfs setstripe <file|dir> -s <bytes/OST> -o <start OST> -c <#OSTs>

  • 1. File or directory for which to set the stripe.
  • 2. The number of bytes on each OST, with k, m, or g for KB, MB or GB.
  • 3. OST index of first stripe (-1 for filesystem default) .
  • 4. Number of OSTs to stripe over.
  • So to stripe across two OSTs, you would call:

lfs setstripe bigfile -s 4m -o -1 -c 2

http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf

slide-65
SLIDE 65

nci.org.au

Invoking Striping

  • Among various lfs commands are lfs getstripe and lfs setstripe.
  • The lfs setstripe command takes four arguments:

lfs setstripe <file|dir> -s <bytes/OST> -o <start OST> -c <#OSTs>

  • 1. File or directory for which to set the stripe.
  • 2. The number of bytes on each OST, with k, m, or g for KB, MB or GB.
  • 3. OST index of first stripe (-1 for filesystem default) .
  • 4. Number of OSTs to stripe over.
  • So to stripe across two OSTs, you would call:

lfs setstripe bigfile -s 4m -o -1 -c 2

Lab – perform striping on Raijin:/short and measure IOPS, B/W

slide-66
SLIDE 66

nci.org.au

End-to-End View

Parallel FS

(GPFS, PVFS..)

I/O Middleware (MPIIO) Application High-level Library

(HDF5,NetCDF, ADIOS)

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-67
SLIDE 67

nci.org.au

Abstraction Layers

  • High Level libraries can

simplify programmers tasks

  • Express IO in terms of

the data structures of the code, not bytes and blocks

  • I/O middleware can

coordinate, improve performance

  • Data Sieving
  • 2-phase I/O

Parallel FS

(GPFS, PVFS..)

I/O Middleware

(MPIIO)

Application High-level Library

(HDF5,NetCDF, ADIOS)

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-68
SLIDE 68

nci.org.au

Data Sieving

  • Combine many non-

contiguous IO requests into fewer, bigger IO requests

  • “Sieve” unwanted data
  • ut
  • Reduces IOPS, makes use
  • f high bandwidth for

sequential IO

Jonathan Dursi

slide-69
SLIDE 69

nci.org.au

Two-Phase IO

  • Collect requests into

larger chunks

  • Have individual nodes

read big blocks

  • Then use network

communications to exchange pieces

  • Fewer IOPS, faster IO
  • Network communication

usually faster

Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf

slide-70
SLIDE 70

nci.org.au

HDF5 – Beyond MPI-IO

slide-71
SLIDE 71

www.hdfgroup.org

OVERVIEW OF PARALLEL HDF5 DESIGN

March 4, 2015 HPC Oil & Gas Workshop 71

slide-72
SLIDE 72

www.hdfgroup.org

  • Parallel HDF5 should allow multiple

processes to perform I/O to an HDF5 file at the same time

  • Single file image for all processes
  • Compare with one file per process design:
  • Expensive post processing
  • Not usable by different number of processes
  • Too many files produced for file system
  • Parallel HDF5 should use a standard

parallel I/O interface

  • Must be portable to different platforms

Parallel HDF5 Requirements

March 4, 2015 HPC Oil & Gas Workshop 72

slide-73
SLIDE 73

www.hdfgroup.org

Design requirements, cont

  • Support Message Passing Interface

(MPI) programming

  • Parallel HDF5 files compatible with

serial HDF5 files

  • Shareable between different serial or

parallel platforms

March 4, 2015 HPC Oil & Gas Workshop 73

slide-74
SLIDE 74

www.hdfgroup.org

Design Dependencies

  • MPI with MPI-IO
  • MPICH, OpenMPI
  • Vendor’s MPI-IO
  • Parallel file system
  • IBM GPFS
  • Lustre
  • PVFS

March 4, 2015 HPC Oil & Gas Workshop 74

slide-75
SLIDE 75

www.hdfgroup.org

PHDF5 implementation layers

HDF5 Application Compute node Compute node Compute node HDF5 Library MPI Library HDF5 file on Parallel File System

Switch network + I/O servers Disk architecture and layout of data on disk

March 4, 2015 HPC Oil & Gas Workshop 75

slide-76
SLIDE 76

www.hdfgroup.org

MPI-IO VS. HDF5

March 4, 2015 HPC Oil & Gas Workshop 76

slide-77
SLIDE 77

www.hdfgroup.org

MPI-IO

  • MPI-IO is an Input/Output API
  • It treats the data file as a “linear byte

stream” and each MPI application needs to provide its own file and data representations to interpret those bytes

March 4, 2015 HPC Oil & Gas Workshop 77

slide-78
SLIDE 78

www.hdfgroup.org

MPI-IO

  • All data stored are machine dependent

except the “external32” representation

  • External32 is defined in Big Endianness
  • Little-endian machines have to do the data

conversion in both read or write operations

  • 64-bit sized data types may lose

information

March 4, 2015 HPC Oil & Gas Workshop 78

slide-79
SLIDE 79

www.hdfgroup.org

MPI-IO vs. HDF5

  • HDF5 is data management software
  • It stores data and metadata according

to the HDF5 data format definition

  • HDF5 file is self-describing
  • Each machine can store the data in its own

native representation for efficient I/O without loss of data precision

  • Any necessary data representation

conversion is done by the HDF5 library automatically

March 4, 2015 HPC Oil & Gas Workshop 79

slide-80
SLIDE 80

nci.org.au

LAB 2

slide-81
SLIDE 81

nci.org.au

Lab 2 – Lustre Striping Parameters on Raijin

Lustre allows you to modify three striping parameters for a shared file:

  • the stripe count controls how many OSTs the file is striped over (for example, the stripe count is 4 for the

file shown above);

  • the stripe size controls the number of bytes in each stripe; and
  • the start index chooses where in the list of OSTs to start the round-robin assignment

(the default value -1 allows the system to choose the offset in order to load balance the file system). The default parameters on Raijin:/short are [count=2, size=1MB, index=-1], but these can be changed and viewed on a per-file or per-directory basis using the commands: % lfs setstripe [file,dir] -c [count] -s [size] -i [index] % lfs getstripe [file,dir] A file automatically inherits the striping parameters of the directory it is created in, so changing the parameters of a directory is a convenient way to set the parameters for a collection of files you are about to

  • create. For instance, if your application creates output files in a subdirectory called output/, you can set the

striping parameters on that directory once before your application runs, and all of your output files will inherit those parameters.

slide-82
SLIDE 82

nci.org.au

Lab 2 – Tasks – 1/2

a) Create a directory under Raijin:/short and run lfs getstripe <myDir>. Explain the output

  • f the command, use the man-pages if

required; b) Re-run the lab exercise with FIO using the following stripe sizes and counts for a 1GB file, for a sequential write workload Stripe Size: 1MB, 4MB Stripe Count: -1, 2, 4

slide-83
SLIDE 83

nci.org.au

Lab 2 – Tasks – 2/2

Stripe Size (MB) Stripe Count

  • No. of

Threads Working Set (MB) Block Size (KB) Time Taken (ms) Data Written (MB) IOPS B/W (MB/se c) 1

  • 1

4 1024 1024 1 2 4 1024 1024 1 4 4 1024 1024 4

  • 1

4 1024 1024 4 2 4 1024 1024 4 4 4 1024 1024

Bonus Credit: Using your ANU UniID and password, log onto the NeCTAR cloud and bring up a single core VM using a CentOS or Ubuntu image on any NeCTAR node. Perform the same tests as in Lab 1. URL for creating VMs: https://dashboard.rc.nectar.org.au/ Setting up SSH keys: http://darlinglab.org/tutorials/instance_startup/

slide-84
SLIDE 84

nci.org.au

CLOSING SLIDE

slide-85
SLIDE 85

nci.org.au

Homebrew HPC Deep-Packet Capture?

https://www.hdfgroup.org/2017/08/handling-ingesting-data-streams-500k-messs/