optimized scatter gather for parallel storage
play

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past


  1. Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163

  2. HPC Storage: Stuck in the Past Los Alamos National Laboratory 3

  3. Replacing POSIX is hard • Great interface • Easy to understand and use • Easy to implement almost correctly • Not scalable for shared use • A lot of unsettled corner cases • Made for programmers • Scientists don’t care about files • they have datasets • they have other things to worry about • best case — know how data is laid out in memory Los Alamos National Laboratory 4

  4. Middleware • Different (better?) user interface • HDF5 • MPI I/O • ADIOS • ArrayQL • Better performance • MPI I/O • PLFS • DeltaFS • GIGA+ • They all have to deal with POSIX idiosyncrasies Los Alamos National Laboratory 5

  5. Complete Systems • Huge effort • Feature creep — even harder to finish • Interoperability? Los Alamos National Laboratory 6

  6. Interfaces are important • Simple • Not too extendable, not too many knobs • Too much freedom is bad, the designer should make the right choices • ASGARD tries to be the best interface for something specific • right level of description of data • for distributed environment • so data can be efficiently gathered from pieces scattered across many nodes • language and library independent Los Alamos National Laboratory 7

  7. Fragments • Describe part of the dataset • Contiguous • Can be materialized in memory, or stored on disk Los Alamos National Laboratory 8

  8. Blocks • Fragments consist of blocks • Describe contiguous region of the fragment • Can be connected to Blocks in other fragments • Each Blocks has: • offset • size • list of Blocks • Three types of Blocks Los Alamos National Laboratory 9

  9. SBlock • “Simple” Block • Properties • offset • size • list of Blocks (connections, same type and size) • Examples: • double -> SBlock of size 8 • uint32_t -> SBlock of size 4 Los Alamos National Laboratory 10

  10. TBlock • “sTruct” Block • Groups other Blocks (of different sizes) • Properties • offset • size • list of Blocks (fields) • list of Blocks (connections, same type) • Offsets of the field Blocks relative to the start of the TBlock • Can have holes Los Alamos National Laboratory 11

  11. ABlock • “Array” Block • Groups Blocks of the same type and size • Properties: • offset • dimension sizes • element order (row-major, column-major, etc.) • element Block • list of Destinations (connections) • Destination • Block � idx i = a i x i + b i • (a i , b i , c i , d i , idx i ) for each dimension c i x i + d i Los Alamos National Laboratory 12

  12. Fragment • Fragment • Collection of Blocks • Top-level Blocks • Transformation ( src , dest ) • For each top-level block in src • SBlock — copy to each destination Block ∈ dest • TBlock — recursively run for each field (keep offsets) • ABlock — for each element with index [ x 1 , x 2 , … , x n ] • calculate offset in src • for each destination Block ∈ dest • calculate index [ y 1 , y 2 , … , y n ] in dest • calculate offset • recursively run transformation for the element Block Los Alamos National Laboratory 13

  13. Transformation Rules S 0000 field a pa T 0000 fragment dataset { var p struct { dest dest a, b, c float32 S 0008 field a } pba T 0004 dest field b } dest S 0004 dest viz fragment default { var p = p dest dest } S 0000 field a fragment viz { p T 0000 field b S 0004 var pa { a } = p field c var pba { b, a } = p S 0008 } default Los Alamos National Laboratory 14

  14. Fragment Sources J (A:0.25) (A:0.25 J:0.25) A B (B:0.3) (B:0.2 J:0.2) C (D:0.4) (D:0.1 J:0.1) D E F G (E:0.2) (E:0.3 J:0.3) H J A B C D F E G H J (H:1.0) A B C D E F G H Los Alamos National Laboratory 15

  15. Transformation Rules Node A Node T Node D rmt xform rmt xform local xform local xform Node B Node C rmt xform local xform local xform rmt xform Los Alamos National Laboratory 16

  16. f1 f eld3 fragment ds { S 000020:2 type P struct { f eld1 S 000008:4 a float64 dest0 f eld0 { i } | { j } dest0 b float32 S 000000:8 dest0 f eld2 c float64 el T 000000:22 dest0 A 000000:110000 (100, 50) S 000012:8 d int16 dest0 dest0 } dest0 dest0 ds dest0 var data [100, 100] P { i } | { j } S 000008:4 dest0 f eld1 dest0 } f eld3 T 000000:22 S 000020:2 dest0 el f eld0 dest0 dest1 fragment f1 { f eld2 A 000000:220000 (100, 100) { i } | { j } S 000000:8 dest2 dest0 var d1 = data { i-25 } | { j-25 } S 000012:8 dest0 dest1 } dest1 dest1 fragment f2 { dest2 dest0 dest0 var d2 { a, c } = data } f2 dest0 el f eld0 A 000000:80000 (50, 100) T 000000:16 S 000000:8 dest0 f eld1 fragment f3 { { i } | { j } S 000008:8 var d3[i, j] {d, c} = data[i-25, j-25] dest1 f3 } dest2 el f eld0 A 000000:25000 (50, 50) T 000000:10 S 000000:2 dest0 f eld1 { i+25 } | { j+25 } S 000002:8 Los Alamos National Laboratory 17

  17. Optimizations ABlock ABlock TBlock TBlock TBlock TBlock TBlock SBlock SBlock TBlock … SBlock SBlock … SBlock SBlock … SBlock SBlock … … … SBlock SBlock SBlock SBlock … TBlock SBlock SBlock SBlock SBlock a. Merging neighboring fields TBlock TBlock TBlock TBlock … … SBlock SBlock … … SBlock SBlock … SBlock SBlock … SBlock SBlock … TBlock SBlock SBlock ABlock ABlock b. Replacing TBlock with a SBlock Los Alamos National Laboratory 18

  18. Ceph Integration • RADOS Objects - custom object class extension • Dataset object • metadata: dataset + stripe definitions • no data • Stripe object • partial read/write using transformation rules • write triggers updates to secondary replicas • Client Side • access unit is fragment • server sends back list of objects and transformation rules • executes local transformation rules, sends to OSD remote transformation rules (+ data) Los Alamos National Laboratory 19

  19. Results: MPI Tile I/O Write Read 160 3000 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 2500 120 2000 Bandwidth (MB/s) 100 80 1500 60 1000 40 500 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 500 1000 1500 2000 2500 3000 3500 4000 4500 Tile Size Tile Size Write Read 160 3500 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 3000 120 2500 Bandwidth (MB/s) 100 2000 80 1500 60 1000 40 500 20 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number Of Ranks Number Of Ranks Los Alamos National Laboratory 20

  20. Results: HPIO Read Contiguous Memory / Contiguous Storage Contiguous Memory / Non-contiguous Storage 1800 1200 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1600 1000 1400 1200 800 Bandwidth (MB/s) 1000 600 800 600 400 400 200 200 0 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Non-contiguous Memory / Contiguous Storage Non-contiguous Memory / Non-contiguous Storage 1100 600 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1000 500 900 800 400 Bandwidth (MB/s) 700 600 300 500 200 400 300 100 200 100 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Region Count Region Count Los Alamos National Laboratory 21

  21. Ceph Bandwidth ASGARD 400 300 200 100 0 Collective MPI I/O Bandwidth (MB/s) 400 300 200 100 0 Non-collective MPI I/O 400 300 200 100 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 22

  22. Ceph Operations ASGARD 150 100 50 0 Collective MPI I/O Time(s) 150 100 50 0 Non-collective MPI I/O Time(s) 1100 Operations per second 800 400 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 23

  23. Conclusions • ASGARD defines language and library independent data description • Compact transformation rules • Small transformation engine (3K LOC) with implementations in Go and C • Easy to integrate in storage systems and libraries • Questions: • is it the right level of data description? • does it make sense to push for general file systems? • what else did we miss? • do we need byte order (LSB, MSB) and/or primary type (IEEE 754, integer)? Los Alamos National Laboratory 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend