Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - - PowerPoint PPT Presentation

optimized scatter gather for parallel storage
SMART_READER_LITE
LIVE PREVIEW

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past


slide-1
SLIDE 1
slide-2
SLIDE 2

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Optimized Scatter/Gather for Parallel Storage

Latchesar Ionkov Carlos Maltzahn Michael Lang

PDSW-DISCS 2017

LA-UR-17-2163

slide-3
SLIDE 3

Los Alamos National Laboratory

HPC Storage: Stuck in the Past

3

slide-4
SLIDE 4

Los Alamos National Laboratory

Replacing POSIX is hard

  • Great interface
  • Easy to understand and use
  • Easy to implement almost correctly
  • Not scalable for shared use
  • A lot of unsettled corner cases
  • Made for programmers
  • Scientists don’t care about files
  • they have datasets
  • they have other things to worry about
  • best case — know how data is laid out in memory

4

slide-5
SLIDE 5

Los Alamos National Laboratory

Middleware

  • Different (better?) user interface
  • HDF5
  • MPI I/O
  • ADIOS
  • ArrayQL
  • Better performance
  • MPI I/O
  • PLFS
  • DeltaFS
  • GIGA+
  • They all have to deal with POSIX idiosyncrasies

5

slide-6
SLIDE 6

Los Alamos National Laboratory

Complete Systems

  • Huge effort
  • Feature creep — even harder to finish
  • Interoperability?

6

slide-7
SLIDE 7

Los Alamos National Laboratory

Interfaces are important

  • Simple
  • Not too extendable, not too many knobs
  • Too much freedom is bad, the designer should make the right choices
  • ASGARD tries to be the best interface for something specific
  • right level of description of data
  • for distributed environment
  • so data can be efficiently gathered from pieces scattered across many

nodes

  • language and library independent

7

slide-8
SLIDE 8

Los Alamos National Laboratory

Fragments

  • Describe part of the dataset
  • Contiguous
  • Can be materialized in memory, or stored on disk

8

slide-9
SLIDE 9

Los Alamos National Laboratory

Blocks

  • Fragments consist of blocks
  • Describe contiguous region of the fragment
  • Can be connected to Blocks in other fragments
  • Each Blocks has:
  • offset
  • size
  • list of Blocks
  • Three types of Blocks

9

slide-10
SLIDE 10

Los Alamos National Laboratory

SBlock

  • “Simple” Block
  • Properties
  • offset
  • size
  • list of Blocks (connections, same type and size)
  • Examples:
  • double -> SBlock of size 8
  • uint32_t -> SBlock of size 4

10

slide-11
SLIDE 11

Los Alamos National Laboratory

TBlock

  • “sTruct” Block
  • Groups other Blocks (of different sizes)
  • Properties
  • offset
  • size
  • list of Blocks (fields)
  • list of Blocks (connections, same type)
  • Offsets of the field Blocks relative to the start of the TBlock
  • Can have holes

11

slide-12
SLIDE 12

Los Alamos National Laboratory

ABlock

  • “Array” Block
  • Groups Blocks of the same type and size
  • Properties:
  • offset
  • dimension sizes
  • element order (row-major, column-major, etc.)
  • element Block
  • list of Destinations (connections)
  • Destination
  • Block
  • (ai, bi, ci, di, idxi) for each dimension

12

idxi = aixi +bi cixi +di

slide-13
SLIDE 13

Los Alamos National Laboratory

Fragment

  • Fragment
  • Collection of Blocks
  • Top-level Blocks
  • Transformation (src, dest)
  • For each top-level block in src
  • SBlock — copy to each destination Block ∈ dest
  • TBlock — recursively run for each field (keep offsets)
  • ABlock — for each element with index [x1, x2, …, xn]
  • calculate offset in src
  • for each destination Block ∈ dest
  • calculate index [y1, y2, …, yn] in dest
  • calculate offset
  • recursively run transformation for the element Block

13

slide-14
SLIDE 14

Los Alamos National Laboratory

Transformation Rules

14

fragment dataset { var p struct { a, b, c float32 } } fragment default { var p = p } fragment viz { var pa { a } = p var pba { b, a } = p }

default viz

T 0004

pba

T 0000

pa

S 0000 S 0008 S 0004 T 0000

p

S 0000 S 0004 S 0008 field a field b field a field a field b field c dest dest dest dest dest dest dest

slide-15
SLIDE 15

Los Alamos National Laboratory

Fragment Sources

15

(A:0.25) (A:0.25 J:0.25) (B:0.3) (B:0.2 J:0.2) (D:0.4) (D:0.1 J:0.1) (E:0.2) (E:0.3 J:0.3) (H:1.0)

J J J B C F G E D A H B C F G E D A H B C F G E D A H

slide-16
SLIDE 16

Los Alamos National Laboratory

Transformation Rules

16

Node A Node B Node C Node D

rmt xform rmt xform rmt xform rmt xform

Node T

local xform local xform local xform local xform

slide-17
SLIDE 17

Los Alamos National Laboratory 17

ds f1 f2 f3 S 000000:8 S 000000:8 dest0 S 000000:8 dest1 S 000008:4 S 000008:4 dest0 S 000012:8 S 000012:8 dest0 S 000008:8 dest1 S 000002:8 dest2 S 000020:2 S 000020:2 dest0 S 000000:2 dest1 T 000000:22 feld0 feld1 feld2 feld3 T 000000:22 dest0 T 000000:16 dest1 T 000000:10 dest2 A 000000:220000 (100, 100) el { i } | { j } dest0 { i } | { j } dest1 { i-25 } | { j-25 } dest2 A 000000:110000 (100, 50) A 000000:80000 (50, 100) A 000000:25000 (50, 50) dest0 dest0 dest0 dest0 dest0 feld0 feld1 feld2 feld3 el { i } | { j } dest0 dest0 dest0 dest0 feld0 feld1 el { i } | { j } dest0 dest0 dest0 dest0 feld0 feld1 el { i+25 } | { j+25 } dest0

fragment ds { type P struct { a float64 b float32 c float64 d int16 } var data [100, 100] P } fragment f1 { var d1 = data } fragment f2 { var d2 { a, c } = data } fragment f3 { var d3[i, j] {d, c} = data[i-25, j-25] }

slide-18
SLIDE 18

Los Alamos National Laboratory

Optimizations

18

TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock TBlock TBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock SBlock

  • a. Merging neighboring fields
  • b. Replacing TBlock with a SBlock

ABlock

TBlock … … … … … … … TBlock TBlock

ABlock ABlock ABlock

TBlock … … … … … … … TBlock TBlock

slide-19
SLIDE 19

Los Alamos National Laboratory

Ceph Integration

  • RADOS Objects - custom object class extension
  • Dataset object
  • metadata: dataset + stripe definitions
  • no data
  • Stripe object
  • partial read/write using transformation rules
  • write triggers updates to secondary replicas
  • Client Side
  • access unit is fragment
  • server sends back list of objects and transformation rules
  • executes local transformation rules, sends to OSD remote

transformation rules (+ data)

19

slide-20
SLIDE 20

Los Alamos National Laboratory

Results: MPI Tile I/O

20

20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000 4500 Bandwidth (MB/s) Tile Size Write ASGARD Collective MPI I/O Non-collective MPI I/O 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500 4000 4500 Tile Size Read ASGARD Collective MPI I/O Non-collective MPI I/O 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 90 100 Bandwidth (MB/s) Number Of Ranks Write ASGARD Collective MPI I/O Non-collective MPI I/O 500 1000 1500 2000 2500 3000 3500 10 20 30 40 50 60 70 80 90 100 Number Of Ranks Read ASGARD Collective MPI I/O Non-collective MPI I/O

slide-21
SLIDE 21

Los Alamos National Laboratory

Results: HPIO Read

21

200 400 600 800 1000 1200 1400 1600 1800 100000 1x106 1x107 Bandwidth (MB/s) Contiguous Memory / Contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 200 400 600 800 1000 1200 100000 1x106 1x107 Contiguous Memory / Non-contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 100 200 300 400 500 600 700 800 900 1000 1100 100000 1x106 1x107 Bandwidth (MB/s) Region Count Non-contiguous Memory / Contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O 100 200 300 400 500 600 100000 1x106 1x107 Region Count Non-contiguous Memory / Non-contiguous Storage ASGARD Collective MPI I/O Non-collective MPI I/O

slide-22
SLIDE 22

Los Alamos National Laboratory

Ceph Bandwidth

22

100 200 300 400 ASGARD 100 200 300 400 Bandwidth (MB/s) Collective MPI I/O 100 200 300 400 500 1000 1500 2000 Time(s) Non-collective MPI I/O

slide-23
SLIDE 23

Los Alamos National Laboratory

Ceph Operations

23

50 100 150 Time(s) ASGARD 50 100 150 Time(s) Collective MPI I/O 400 800 1100 500 1000 1500 2000 Operations per second Time(s) Non-collective MPI I/O

slide-24
SLIDE 24

Los Alamos National Laboratory

Conclusions

  • ASGARD defines language and library independent data description
  • Compact transformation rules
  • Small transformation engine (3K LOC) with implementations in Go and C
  • Easy to integrate in storage systems and libraries
  • Questions:
  • is it the right level of data description?
  • does it make sense to push for general file systems?
  • what else did we miss?
  • do we need byte order (LSB, MSB) and/or primary type (IEEE 754,

integer)?

24