Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - - PowerPoint PPT Presentation
Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - - PowerPoint PPT Presentation
Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Optimized Scatter/Gather for Parallel Storage
Latchesar Ionkov Carlos Maltzahn Michael Lang
PDSW-DISCS 2017
LA-UR-17-2163
Los Alamos National Laboratory
HPC Storage: Stuck in the Past
3
Los Alamos National Laboratory
Replacing POSIX is hard
- Great interface
- Easy to understand and use
- Easy to implement almost correctly
- Not scalable for shared use
- A lot of unsettled corner cases
- Made for programmers
- Scientists don’t care about files
- they have datasets
- they have other things to worry about
- best case — know how data is laid out in memory
4
Los Alamos National Laboratory
Middleware
- Different (better?) user interface
- HDF5
- MPI I/O
- ADIOS
- ArrayQL
- Better performance
- MPI I/O
- PLFS
- DeltaFS
- GIGA+
- They all have to deal with POSIX idiosyncrasies
5
Los Alamos National Laboratory
Complete Systems
- Huge effort
- Feature creep — even harder to finish
- Interoperability?
6
Los Alamos National Laboratory
Interfaces are important
- Simple
- Not too extendable, not too many knobs
- Too much freedom is bad, the designer should make the right choices
- ASGARD tries to be the best interface for something specific
- right level of description of data
- for distributed environment
- so data can be efficiently gathered from pieces scattered across many
nodes
- language and library independent
7
Los Alamos National Laboratory
Fragments
- Describe part of the dataset
- Contiguous
- Can be materialized in memory, or stored on disk
8
Los Alamos National Laboratory
Blocks
- Fragments consist of blocks
- Describe contiguous region of the fragment
- Can be connected to Blocks in other fragments
- Each Blocks has:
- offset
- size
- list of Blocks
- Three types of Blocks
9
Los Alamos National Laboratory
SBlock
- “Simple” Block
- Properties
- offset
- size
- list of Blocks (connections, same type and size)
- Examples:
- double -> SBlock of size 8
- uint32_t -> SBlock of size 4
10
Los Alamos National Laboratory
TBlock
- “sTruct” Block
- Groups other Blocks (of different sizes)
- Properties
- offset
- size
- list of Blocks (fields)
- list of Blocks (connections, same type)
- Offsets of the field Blocks relative to the start of the TBlock
- Can have holes
11
Los Alamos National Laboratory
ABlock
- “Array” Block
- Groups Blocks of the same type and size
- Properties:
- offset
- dimension sizes
- element order (row-major, column-major, etc.)
- element Block
- list of Destinations (connections)
- Destination
- Block
- (ai, bi, ci, di, idxi) for each dimension
12
idxi = aixi +bi cixi +di
Los Alamos National Laboratory
Fragment
- Fragment
- Collection of Blocks
- Top-level Blocks
- Transformation (src, dest)
- For each top-level block in src
- SBlock — copy to each destination Block ∈ dest
- TBlock — recursively run for each field (keep offsets)
- ABlock — for each element with index [x1, x2, …, xn]
- calculate offset in src
- for each destination Block ∈ dest
- calculate index [y1, y2, …, yn] in dest
- calculate offset
- recursively run transformation for the element Block
13
Los Alamos National Laboratory
Transformation Rules
14
fragment dataset { var p struct { a, b, c float32 } } fragment default { var p = p } fragment viz { var pa { a } = p var pba { b, a } = p }
default viz
T 0004
pba
T 0000
pa
S 0000 S 0008 S 0004 T 0000
p
S 0000 S 0004 S 0008 field a field b field a field a field b field c dest dest dest dest dest dest dest
Los Alamos National Laboratory
Fragment Sources
15
(A:0.25) (A:0.25 J:0.25) (B:0.3) (B:0.2 J:0.2) (D:0.4) (D:0.1 J:0.1) (E:0.2) (E:0.3 J:0.3) (H:1.0)
J J J B C F G E D A H B C F G E D A H B C F G E D A H
Los Alamos National Laboratory
Transformation Rules
16
Node A Node B Node C Node D
rmt xform rmt xform rmt xform rmt xform
Node T
local xform local xform local xform local xform
Los Alamos National Laboratory 17
ds f1 f2 f3 S 000000:8 S 000000:8 dest0 S 000000:8 dest1 S 000008:4 S 000008:4 dest0 S 000012:8 S 000012:8 dest0 S 000008:8 dest1 S 000002:8 dest2 S 000020:2 S 000020:2 dest0 S 000000:2 dest1 T 000000:22 feld0 feld1 feld2 feld3 T 000000:22 dest0 T 000000:16 dest1 T 000000:10 dest2 A 000000:220000 (100, 100) el { i } | { j } dest0 { i } | { j } dest1 { i-25 } | { j-25 } dest2 A 000000:110000 (100, 50) A 000000:80000 (50, 100) A 000000:25000 (50, 50) dest0 dest0 dest0 dest0 dest0 feld0 feld1 feld2 feld3 el { i } | { j } dest0 dest0 dest0 dest0 feld0 feld1 el { i } | { j } dest0 dest0 dest0 dest0 feld0 feld1 el { i+25 } | { j+25 } dest0
fragment ds { type P struct { a float64 b float32 c float64 d int16 } var data [100, 100] P } fragment f1 { var d1 = data } fragment f2 { var d2 { a, c } = data } fragment f3 { var d3[i, j] {d, c} = data[i-25, j-25] }
Los Alamos National Laboratory
Optimizations
18
TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock
- a. Merging neighboring fields
- b. Replacing TBlock with a SBlock
ABlock
TBlock … … … … … … … TBlock TBlock
ABlock ABlock ABlock
TBlock … … … … … … … TBlock TBlock
Los Alamos National Laboratory
Ceph Integration
- RADOS Objects - custom object class extension
- Dataset object
- metadata: dataset + stripe definitions
- no data
- Stripe object
- partial read/write using transformation rules
- write triggers updates to secondary replicas
- Client Side
- access unit is fragment
- server sends back list of objects and transformation rules
- executes local transformation rules, sends to OSD remote
transformation rules (+ data)
19
Los Alamos National Laboratory
Results: MPI Tile I/O
20
20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000 4500 Bandwidth (MB/s) Tile Size Write ASGARD Collective MPI I/O Non-collective MPI I/O 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500 4000 4500 Tile Size Read ASGARD Collective MPI I/O Non-collective MPI I/O 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 90 100 Bandwidth (MB/s) Number Of Ranks Write ASGARD Collective MPI I/O Non-collective MPI I/O 500 1000 1500 2000 2500 3000 3500 10 20 30 40 50 60 70 80 90 100 Number Of Ranks Read ASGARD Collective MPI I/O Non-collective MPI I/O
Los Alamos National Laboratory
Results: HPIO Read
21
200 400 600 800 1000 1200 1400 1600 1800 100000 1x106 1x107 Bandwidth (MB/s) Contiguous Memory / Contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 200 400 600 800 1000 1200 100000 1x106 1x107 Contiguous Memory / Non-contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 100 200 300 400 500 600 700 800 900 1000 1100 100000 1x106 1x107 Bandwidth (MB/s) Region Count Non-contiguous Memory / Contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 100 200 300 400 500 600 100000 1x106 1x107 Region Count Non-contiguous Memory / Non-contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O
Los Alamos National Laboratory
Ceph Bandwidth
22
100 200 300 400 ASGARD 100 200 300 400 Bandwidth (MB/s) Collective MPI I/O 100 200 300 400 500 1000 1500 2000 Time(s) Non-collective MPI I/O
Los Alamos National Laboratory
Ceph Operations
23
50 100 150 Time(s) ASGARD 50 100 150 Time(s) Collective MPI I/O 400 800 1100 500 1000 1500 2000 Operations per second Time(s) Non-collective MPI I/O
Los Alamos National Laboratory
Conclusions
- ASGARD defines language and library independent data description
- Compact transformation rules
- Small transformation engine (3K LOC) with implementations in Go and C
- Easy to integrate in storage systems and libraries
- Questions:
- is it the right level of data description?
- does it make sense to push for general file systems?
- what else did we miss?
- do we need byte order (LSB, MSB) and/or primary type (IEEE 754,
integer)?
24