Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163

HPC Storage: Stuck in the Past Los Alamos National Laboratory 3

Replacing POSIX is hard • Great interface • Easy to understand and use • Easy to implement almost correctly • Not scalable for shared use • A lot of unsettled corner cases • Made for programmers • Scientists don’t care about files • they have datasets • they have other things to worry about • best case — know how data is laid out in memory Los Alamos National Laboratory 4

Middleware • Different (better?) user interface • HDF5 • MPI I/O • ADIOS • ArrayQL • Better performance • MPI I/O • PLFS • DeltaFS • GIGA+ • They all have to deal with POSIX idiosyncrasies Los Alamos National Laboratory 5

Complete Systems • Huge effort • Feature creep — even harder to finish • Interoperability? Los Alamos National Laboratory 6

Interfaces are important • Simple • Not too extendable, not too many knobs • Too much freedom is bad, the designer should make the right choices • ASGARD tries to be the best interface for something specific • right level of description of data • for distributed environment • so data can be efficiently gathered from pieces scattered across many nodes • language and library independent Los Alamos National Laboratory 7

Fragments • Describe part of the dataset • Contiguous • Can be materialized in memory, or stored on disk Los Alamos National Laboratory 8

Blocks • Fragments consist of blocks • Describe contiguous region of the fragment • Can be connected to Blocks in other fragments • Each Blocks has: • offset • size • list of Blocks • Three types of Blocks Los Alamos National Laboratory 9

SBlock • “Simple” Block • Properties • offset • size • list of Blocks (connections, same type and size) • Examples: • double -> SBlock of size 8 • uint32_t -> SBlock of size 4 Los Alamos National Laboratory 10

TBlock • “sTruct” Block • Groups other Blocks (of different sizes) • Properties • offset • size • list of Blocks (fields) • list of Blocks (connections, same type) • Offsets of the field Blocks relative to the start of the TBlock • Can have holes Los Alamos National Laboratory 11

ABlock • “Array” Block • Groups Blocks of the same type and size • Properties: • offset • dimension sizes • element order (row-major, column-major, etc.) • element Block • list of Destinations (connections) • Destination • Block � idx i = a i x i + b i • (a i , b i , c i , d i , idx i ) for each dimension c i x i + d i Los Alamos National Laboratory 12

Fragment • Fragment • Collection of Blocks • Top-level Blocks • Transformation ( src , dest ) • For each top-level block in src • SBlock — copy to each destination Block ∈ dest • TBlock — recursively run for each field (keep offsets) • ABlock — for each element with index [ x 1 , x 2 , … , x n ] • calculate offset in src • for each destination Block ∈ dest • calculate index [ y 1 , y 2 , … , y n ] in dest • calculate offset • recursively run transformation for the element Block Los Alamos National Laboratory 13

Transformation Rules S 0000 field a pa T 0000 fragment dataset { var p struct { dest dest a, b, c float32 S 0008 field a } pba T 0004 dest field b } dest S 0004 dest viz fragment default { var p = p dest dest } S 0000 field a fragment viz { p T 0000 field b S 0004 var pa { a } = p field c var pba { b, a } = p S 0008 } default Los Alamos National Laboratory 14

Fragment Sources J (A:0.25) (A:0.25 J:0.25) A B (B:0.3) (B:0.2 J:0.2) C (D:0.4) (D:0.1 J:0.1) D E F G (E:0.2) (E:0.3 J:0.3) H J A B C D F E G H J (H:1.0) A B C D E F G H Los Alamos National Laboratory 15

Transformation Rules Node A Node T Node D rmt xform rmt xform local xform local xform Node B Node C rmt xform local xform local xform rmt xform Los Alamos National Laboratory 16

f1 f eld3 fragment ds { S 000020:2 type P struct { f eld1 S 000008:4 a float64 dest0 f eld0 { i } | { j } dest0 b float32 S 000000:8 dest0 f eld2 c float64 el T 000000:22 dest0 A 000000:110000 (100, 50) S 000012:8 d int16 dest0 dest0 } dest0 dest0 ds dest0 var data [100, 100] P { i } | { j } S 000008:4 dest0 f eld1 dest0 } f eld3 T 000000:22 S 000020:2 dest0 el f eld0 dest0 dest1 fragment f1 { f eld2 A 000000:220000 (100, 100) { i } | { j } S 000000:8 dest2 dest0 var d1 = data { i-25 } | { j-25 } S 000012:8 dest0 dest1 } dest1 dest1 fragment f2 { dest2 dest0 dest0 var d2 { a, c } = data } f2 dest0 el f eld0 A 000000:80000 (50, 100) T 000000:16 S 000000:8 dest0 f eld1 fragment f3 { { i } | { j } S 000008:8 var d3[i, j] {d, c} = data[i-25, j-25] dest1 f3 } dest2 el f eld0 A 000000:25000 (50, 50) T 000000:10 S 000000:2 dest0 f eld1 { i+25 } | { j+25 } S 000002:8 Los Alamos National Laboratory 17

Optimizations ABlock ABlock TBlock TBlock TBlock TBlock TBlock SBlock SBlock TBlock … SBlock SBlock … SBlock SBlock … SBlock SBlock … … … SBlock SBlock SBlock SBlock … TBlock SBlock SBlock SBlock SBlock a. Merging neighboring fields TBlock TBlock TBlock TBlock … … SBlock SBlock … … SBlock SBlock … SBlock SBlock … SBlock SBlock … TBlock SBlock SBlock ABlock ABlock b. Replacing TBlock with a SBlock Los Alamos National Laboratory 18

Ceph Integration • RADOS Objects - custom object class extension • Dataset object • metadata: dataset + stripe definitions • no data • Stripe object • partial read/write using transformation rules • write triggers updates to secondary replicas • Client Side • access unit is fragment • server sends back list of objects and transformation rules • executes local transformation rules, sends to OSD remote transformation rules (+ data) Los Alamos National Laboratory 19

Results: MPI Tile I/O Write Read 160 3000 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 2500 120 2000 Bandwidth (MB/s) 100 80 1500 60 1000 40 500 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 500 1000 1500 2000 2500 3000 3500 4000 4500 Tile Size Tile Size Write Read 160 3500 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 140 3000 120 2500 Bandwidth (MB/s) 100 2000 80 1500 60 1000 40 500 20 0 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number Of Ranks Number Of Ranks Los Alamos National Laboratory 20

Results: HPIO Read Contiguous Memory / Contiguous Storage Contiguous Memory / Non-contiguous Storage 1800 1200 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1600 1000 1400 1200 800 Bandwidth (MB/s) 1000 600 800 600 400 400 200 200 0 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Non-contiguous Memory / Contiguous Storage Non-contiguous Memory / Non-contiguous Storage 1100 600 ASGARD ASGARD Collective MPI I/O Collective MPI I/O Non-collective MPI I/O Non-collective MPI I/O 1000 500 900 800 400 Bandwidth (MB/s) 700 600 300 500 200 400 300 100 200 100 0 1x10 6 1x10 7 1x10 6 1x10 7 100000 100000 Region Count Region Count Los Alamos National Laboratory 21

Ceph Bandwidth ASGARD 400 300 200 100 0 Collective MPI I/O Bandwidth (MB/s) 400 300 200 100 0 Non-collective MPI I/O 400 300 200 100 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 22

Ceph Operations ASGARD 150 100 50 0 Collective MPI I/O Time(s) 150 100 50 0 Non-collective MPI I/O Time(s) 1100 Operations per second 800 400 0 0 500 1000 1500 2000 Time(s) Los Alamos National Laboratory 23

Conclusions • ASGARD defines language and library independent data description • Compact transformation rules • Small transformation engine (3K LOC) with implementations in Go and C • Easy to integrate in storage systems and libraries • Questions: • is it the right level of data description? • does it make sense to push for general file systems? • what else did we miss? • do we need byte order (LSB, MSB) and/or primary type (IEEE 754, integer)? Los Alamos National Laboratory 24

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 1 2 3 4 5 Logical and

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

Basic Communication Operations (cont.) Alexandre David B2-206 Today Scatter and Gather

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Scatter Creek Aquifer Area S eptic S ystem Management Proj ect Public Health and Social

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Robust scatter regularization G. Haesbroeck and C. Croux University of Li` ege - University of

Making a scatter plot IN TR OD U C TION TO DATA SC IE N C E IN P YTH ON Hillar y Green -

Outline DM812 METAHEURISTICS Lecture 5 1. Resume Scatter Search and Path Relinking Marco

Estatstica e Modelos Probabilsticos - COE241 Aula de hoje Introduo a Regresso linear

Estatstica e Modelos Probabilsticos - COE241 Aula passada Aula de hoje Goodness of fit:

Plotting Data March 5, 2010 Derek Ruths Why plot data programmatically? Different kinds of

The Twenty-nineth Sunday in Ordinary Time Welcome Home Gathering Song Gather Your People

1 3. Simple Memory System Cache Address Translation Example #1 16 lines, 4-byte block size

Corrections and Transfers (Updated 4/14/12) Bill Cahill 4-5124 bpcahil@iastate.edu Agenda

STM32L476 Parallel I/O Ports References: STM32L4x6 Reference Manual STM32L476xx Data Sheet

BCM Performance Study and Calibrations for F2/EMC Spring 2018 Experiment F2-EMC Collaboration

The Type Sanitizer: Free Yourself from -fno-strict-aliasing Hal Finkel Argonne National

CS415: Systems Programming File related System Calls Most of the slides in this lecture are

LZ4, BulkIO, and offset removal performance Jim Pivarski Princeton University DIANA October

CS 423 Operating System Design: The Kernel Abstraction Professor Adam Bates CS423:

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 - PowerPoint PPT Presentation

Optimized Scatter/Gather for Parallel Storage PDSW-DISCS 2017 Latchesar Ionkov Carlos Maltzahn Michael Lang Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA LA-UR-17-2163 HPC Storage: Stuck in the Past

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

PHI: ARCHITECTURAL SUPPORT FOR SYNCHRONIZATION- AND BANDWIDTH-EFFICIENT COMMUTATIVE SCATTER

MPI types, Scatter and Scatterv MPI types, Scatter and Scatterv 0 1 2 3 4 5 Logical and

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

Basic Communication Operations (cont.) Alexandre David B2-206 Today Scatter and Gather

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Scatter Creek Aquifer Area S eptic S ystem Management Proj ect Public Health and Social

Robust Statistics Part 2: Multivariate location and scatter Peter Rousseeuw LARS-IASC School,

Robust scatter regularization G. Haesbroeck and C. Croux University of Li` ege - University of

Making a scatter plot IN TR OD U C TION TO DATA SC IE N C E IN P YTH ON Hillar y Green -

Outline DM812 METAHEURISTICS Lecture 5 1. Resume Scatter Search and Path Relinking Marco

Estatstica e Modelos Probabilsticos - COE241 Aula de hoje Introduo a Regresso linear

Estatstica e Modelos Probabilsticos - COE241 Aula passada Aula de hoje Goodness of fit:

Plotting Data March 5, 2010 Derek Ruths Why plot data programmatically? Different kinds of

The Twenty-nineth Sunday in Ordinary Time Welcome Home Gathering Song Gather Your People

1 3. Simple Memory System Cache Address Translation Example #1 16 lines, 4-byte block size

Corrections and Transfers (Updated 4/14/12) Bill Cahill 4-5124 bpcahil@iastate.edu Agenda

STM32L476 Parallel I/O Ports References: STM32L4x6 Reference Manual STM32L476xx Data Sheet

BCM Performance Study and Calibrations for F2/EMC Spring 2018 Experiment F2-EMC Collaboration

The Type Sanitizer: Free Yourself from -fno-strict-aliasing Hal Finkel Argonne National

CS415: Systems Programming File related System Calls Most of the slides in this lecture are

LZ4, BulkIO, and offset removal performance Jim Pivarski Princeton University DIANA October

CS 423 Operating System Design: The Kernel Abstraction Professor Adam Bates CS423:

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE