Me Method odol olog ogy f for t or the R Rapid D Develop - - PowerPoint PPT Presentation

me method odol olog ogy f for t or the r rapid d develop
SMART_READER_LITE
LIVE PREVIEW

Me Method odol olog ogy f for t or the R Rapid D Develop - - PowerPoint PPT Presentation

Me Method odol olog ogy f for t or the R Rapid D Develop opme ment of of Sc Scalable H HPC D C Data Se Service ces Matthieu Dorier, Philip Carns, Kevin Harms, Robert Latham, Robert Ross, Shane Snyder, Justin Wozniak, Samuel K.


slide-1
SLIDE 1

Me Method

  • dol
  • log
  • gy f

for t

  • r the R

Rapid D Develop

  • pme

ment of

  • f

Sc Scalable H HPC D C Data Se Service ces

PDSW-DISCS 2018 Dallas, TX

Matthieu Dorier, Philip Carns, Kevin Harms, Robert Latham, Robert Ross, Shane Snyder, Justin Wozniak, Samuel K. Gituérrez, Bob Robey, Brad Settlemyer, Galen Shipman, Jerome Soumagne, James Kowalkowski, Marc Paterno, Saba Sehrish

1

slide-2
SLIDE 2

Ne New Application

  • ns and Systems: De

Demand for

  • r Ne

New Services

Data Simulation Learning “pillars”

Top image credit B. Helland (ASCR). Bottom left and right images credit ALCF. Bottom center image credit OLCF.

2

slide-3
SLIDE 3

Ne New Application

  • ns and Systems: De

Demand for

  • r Ne

New Services

Data Simulation Learning “pillars”

Top image credit B. Helland (ASCR). Bottom left and right images credit ALCF. Bottom center image credit OLCF.

  • Different application use cases have different data needs
  • “One size fits all” doesn’t work: need customized data services for

each to meet mission goals

  • This poses a significant technical challenge: how to enable rapid

development of such services (agility) while still preserving performance (efficiency) and production quality (maintainability) Key idea: address this challenge via composable data services.

3

slide-4
SLIDE 4

Towards reusable components for data services

Parallel File Systems Hand-crafted Data Services

Advantages

  • Well established
  • Standard interface

Drawbacks

  • Single consistency model
  • Complex to maintain and tune
  • Files often inappropriate

4

slide-5
SLIDE 5

Towards reusable components for data services

Parallel File Systems Hand-crafted Data Services

Advantages

  • Well established
  • Standard interface

Drawbacks

  • Single consistency model
  • Complex to maintain and tune
  • Files often inappropriate

Advantages

  • Tuned for this application
  • Appropriate consistency model
  • Appropriate data model

Drawbacks

  • Difficult to maintain
  • Not reusable
  • Scare users

5

slide-6
SLIDE 6

Towards reusable components for data services

Parallel File Systems Hand-crafted Data Services

Advantages

  • Well established
  • Standard interface

Drawbacks

  • Single consistency model
  • Complex to maintain and tune
  • Files often inappropriate

Advantages

  • Tuned for this application
  • Appropriate consistency model
  • Appropriate data model

Drawbacks

  • Difficult to maintain
  • Not reusable
  • Scare users
  • Reusable across services
  • Easy to maintain
  • Not so scary to users
  • Adaptable, configurable
  • Can use latest tech

Composable micro-services

6

slide-7
SLIDE 7

Common capabilities need by data services

  • Runtime substrate

RPC, RDMA

Threading/Tasking

  • Core components

Bulk storage management

Key/Value storage

Group membership

Diagnostics and monitoring

  • Programmability/Expressiveness

Embedded interpreters

Wrappers (Python, C++, etc.)

7

slide-8
SLIDE 8

Common capabilities need by data services

  • Runtime substrate

RPC, RDMA

Threading/Tasking

  • Core components

Bulk storage management

Key/Value storage

Group membership

Diagnostics and monitoring

  • Programmability/Expressiveness

Embedded interpreters

Wrappers (Python, C++, etc.)

8

slide-9
SLIDE 9

Common capabilities need by data services

  • Runtime substrate

RPC, RDMA

Threading/Tasking

  • Core components

Bulk storage management

Key/Value storage

Group membership

Diagnostics and monitoring

  • Programmability/Expressiveness

Embedded interpreters

Wrappers (Python, C++, etc.)

9

slide-10
SLIDE 10

Common capabilities need by data services

  • Runtime substrate

RPC, RDMA

Threading/Tasking

  • Core components

Bulk storage management

Key/Value storage

Group membership

Diagnostics and monitoring

  • Programmability/Expressiveness

Embedded interpreters

Wrappers (Python, C++, etc.)

  • Composed services

FlameStore

HEPnOS

SDSDKV

10

slide-11
SLIDE 11

Challenges in Composing HPC Microservices

  • Formalize composition
  • Unify single-process, multi-

process, single-node, and multi- node designs

  • Maximize efficient use of

resources (network, storage)

11

slide-12
SLIDE 12

Vision Lowering the barriers to distributed services in computational science. Approach

  • Familiar models (key/value, object, file)
  • Easy to build, adapt, and deploy
  • Lightweight, user-space components
  • Modern hardware support

Impact

  • Better, more capable services for specific use

cases on high-end platforms

  • Significant code reuse
  • Ecosystem for service development

http://www.mcs.anl.gov/research/projects/mochi/

HPC

Fast Transports Scientific Data User-level Threads

Cloud Computing

Object Stores Key-Value Stores

Distributed Computing

Group Membership Communication

Software Engineering

Composability

Autonomics

  • Dist. Control

Adaptability

Mochi

slide-13
SLIDE 13

Let’s dive into the methodology

13

slide-14
SLIDE 14

Matching building blocks to user requirements

User Requirements Service Requirements Composition and Interfacing Building Blocks

Data model Access pattern Guaranties Data organization Metadata organization User interface Composition glue code API implementation Runtime Service providers

14

slide-15
SLIDE 15

Identifying application needs

  • Which data model?
  • Arrays, meshes, objects
  • Namespace, metadata
  • Which access pattern?
  • Characteristics (e.g. access sizes)
  • Collective/individual accesses
  • Which guarantees?
  • Consistency
  • Performance
  • Persistence

User Requirements

15

slide-16
SLIDE 16

Identifying application needs

  • Which data model?
  • Arrays, meshes, objects
  • Namespace, metadata
  • Which access pattern?
  • Characteristics (e.g. access sizes)
  • Collective/individual accesses
  • Which guarantees?
  • Consistency
  • Performance
  • Persistence

User Requirements

16

slide-17
SLIDE 17

Identifying application needs

  • Which data model?
  • Arrays, meshes, objects
  • Namespace, metadata
  • Which access pattern?
  • Characteristics (e.g. access sizes)
  • Collective/individual accesses
  • Which guarantees?
  • Consistency
  • Performance
  • Persistence

User Requirements

17

slide-18
SLIDE 18

Service Requirements

Defining service requirements

  • Which data model?
  • Arrays, meshes, objects
  • Namespace, metadata
  • Which access pattern?
  • Characteristics (e.g. access sizes)
  • Collective/individual accesses
  • Which guarantees?
  • Consistency
  • Performance
  • Persistence
  • How should data be organized?
  • Sharding, distribution, replication
  • How should metadata be organized?
  • Distribution, content, indexing
  • How do clients interface with the service?
  • Programming language, API

18

slide-19
SLIDE 19

What do components look like?

19

slide-20
SLIDE 20

Components: engineering challenges

  • How do components share resource (CPU, network, memory)

without interfering with one another?

  • Bad approach: each component has its own progress loop
  • How do we leverage massively multi-core nodes to, for

instance, assign components to cores, make components efficiently share a core, prevent components from interfering with network progress?…

  • Bad approach: each component manages its own thread(s)
  • How can we support a wide range of networks?
  • Bad approach: reimplement for new transport every time the code is

ported to a new platform

20

Building Blocks

slide-21
SLIDE 21

Components: engineering challenges

  • How do components share resource (CPU, network, memory)

without interfering with one another?

  • Bad approach: each component has its own progress loop
  • How do we leverage massively multi-core nodes to, for

instance, assign components to cores, make components efficiently share a core, prevent components from interfering with network progress?…

  • Bad approach: each component manages its own thread(s)
  • How can we support a wide range of networks?
  • Bad approach: reimplement for new transport every time the code is

ported to a new platform

21

Building Blocks

slide-22
SLIDE 22

Components: engineering challenges

  • How do components share resource (CPU, network, memory)

without interfering with one another?

  • Bad approach: each component has its own progress loop
  • How do we leverage massively multi-core nodes to, for

instance, assign components to cores, make components efficiently share a core, prevent components from interfering with network progress?…

  • Bad approach: each component manages its own thread(s)
  • How can we support a wide range of networks?
  • Bad approach: reimplement for new transport every time the code is

ported to a new platform

22

Building Blocks

slide-23
SLIDE 23

Anatomy of a Mochi component

Building Blocks

23

slide-24
SLIDE 24

Building Blocks

24

Anatomy of a Mochi component

slide-25
SLIDE 25

Building Blocks

25

Anatomy of a Mochi component

slide-26
SLIDE 26

Building Blocks

26

Anatomy of a Mochi component

slide-27
SLIDE 27

Building Blocks

27

Anatomy of a Mochi component

slide-28
SLIDE 28

Building Blocks

28

Anatomy of a Mochi component

slide-29
SLIDE 29

Building Blocks

29

Anatomy of a Mochi component

slide-30
SLIDE 30

Building Blocks

30

Anatomy of a Mochi component

slide-31
SLIDE 31

Composition made very easy

  • Example composition code in Python
  • (components themselves are programmed in C or C++)

# some import statements here mid = MargoInstance("gni") bake_provider = BakeProvider(mid, 1) bake_provider.add_storage_target("/local/ssd/space.dat") sdskv_provider = SDSKVProvider(mid, 1) sdskv_provider.add_database("mydatabase", "/tmp/sdskv", pysdskv.server.leveldb)) mid.wait_for_finalize() BAKE component managing RDMA to storage target SDSKV component managing a database Initializing Margo runtime (using Cray GNI network for Mercury)

31

slide-32
SLIDE 32

HEPnOS

Fast event-store for High Energy Physics experiments

32

slide-33
SLIDE 33

User Requirements

Storing “products”

  • From experiments
  • From simulations or analysis workflows

Data model

  • Products are instances of C++ objects
  • Hierarchy: datasets, runs, subruns, events
  • Products are labeled by an “input tag”

Access pattern

  • Write-once-read-many
  • Products accessed atomically
  • Access by input tag and by type
  • Iterators to navigate the hierarchy

33

slide-34
SLIDE 34

User Requirements

Storing “products”

  • From experiments
  • From simulations or analysis workflows

Data model

  • Products are instances of C++ objects
  • Hierarchy: datasets, runs, subruns, events
  • Products are labeled by an “input tag”

Access pattern

  • Write-once-read-many
  • Products accessed atomically
  • Access by input tag and by type
  • Iterators to navigate the hierarchy

34

slide-35
SLIDE 35

User Requirements

Storing “products”

  • From experiments
  • From simulations or analysis workflows

Data model

  • Products are instances of C++ objects
  • Hierarchy: datasets, runs, subruns, events
  • Products are labeled by an “input tag”

Access pattern

  • Write-once-read-many
  • Products accessed atomically
  • Access by input tag and by type
  • Iterators to navigate the hierarchy

35

slide-36
SLIDE 36

User Requirements

Envisioned usage

  • Long-running (weeks), resizable cache based
  • n fast, in-compute-node storage (SSDs,

NVRAM, local memory)

  • Accessed by multiple applications

concurrently

  • Backed-up by a more permanent storage

system (parallel file system, archive system,

  • bject store) when undeployed

36

slide-37
SLIDE 37

Service Requirements

How objects should be distributed?

  • Based on the hash of a “path-like” string
  • <dataset>/<run>/<subrun>/<event>/<input-tag>/<object-type>
  • myproject/mydata%45%23%678#exp1_alpha_std::map<int,Particle>

Should objects be sharded?

  • No

Should objects be replicated?

  • Maybe

How should metadata be managed?

  • Same path-like strings as products
  • Hash is based on the “parent” path in the hierarchy so that all

containers belonging to the same parent end up on the same node

What should the API look like?

  • C++ with template metaprogramming to handle storage of any type of
  • bject, and iterator constructs to navigate the hierarchy

37

slide-38
SLIDE 38

Service Requirements

How objects should be distributed?

  • Based on the hash of a “path-like” string
  • <dataset>/<run>/<subrun>/<event>/<input-tag>/<object-type>
  • myproject/mydata%45%23%678#exp1_alpha_std::map<int,Particle>

Should objects be sharded?

  • No

Should objects be replicated?

  • Maybe

How should metadata be managed?

  • Same path-like strings as products
  • Hash is based on the “parent” path in the hierarchy so that all

containers belonging to the same parent end up on the same node

What should the API look like?

  • C++ with template metaprogramming to handle storage of any type of
  • bject, and iterator constructs to navigate the hierarchy

38

slide-39
SLIDE 39

Service Requirements

How objects should be distributed?

  • Based on the hash of a “path-like” string
  • <dataset>/<run>/<subrun>/<event>/<input-tag>/<object-type>
  • myproject/mydata%45%23%678#exp1_alpha_std::map<int,Particle>

Should objects be sharded?

  • No

Should objects be replicated?

  • Maybe

How should metadata be managed?

  • Same path-like strings as products
  • Hash is based on the “parent” path in the hierarchy so that all

containers belonging to the same parent end up on the same node

What should the API look like?

  • C++ with template metaprogramming to handle storage of any type of
  • bject, and iterator constructs to navigate the hierarchy

39

slide-40
SLIDE 40

Service Requirements

How objects should be distributed?

  • Based on the hash of a “path-like” string
  • <dataset>/<run>/<subrun>/<event>/<input-tag>/<object-type>
  • myproject/mydata%45%23%678#exp1_alpha_std::map<int,Particle>

Should objects be sharded?

  • No

Should objects be replicated?

  • Maybe

How should metadata be managed?

  • Same path-like strings as products
  • Hash is based on the “parent” path in the hierarchy so that all

containers belonging to the same parent end up on the same node

What should the API look like?

  • C++ with template metaprogramming to handle storage of any type of
  • bject, and iterator constructs to navigate the hierarchy

40

slide-41
SLIDE 41

Service Requirements

How objects should be distributed?

  • Based on the hash of a “path-like” string
  • <dataset>/<run>/<subrun>/<event>/<input-tag>/<object-type>
  • myproject/mydata%45%23%678#exp1_alpha_std::map<int,Particle>

Should objects be sharded?

  • No

Should objects be replicated?

  • Maybe

How should metadata be managed?

  • Same path-like strings as products
  • Hash is based on the “parent” path in the hierarchy so that all

containers belonging to the same parent end up on the same node

What should the API look like?

  • C++ with template metaprogramming to handle storage of any type of
  • bject, and iterator constructs to navigate the hierarchy

41

slide-42
SLIDE 42

Margo runtime (Mercury + Argobots)

BAKE SDS-KeyVal Client

RPC RDMA

PMEM LevelDB C++ API Composition and Interfacing

Building Blocks

Boost, YAML

42

slide-43
SLIDE 43

Code sample of HEPnOS

#include <hepnos.hpp> // example structure struct Particle { float x, y, z; // member variables // serialization function for boost to use template<typename A> void serialize(A& a, unsigned long version) { ar & x & y & z; } }; // initialize a handle to the HEPnOS datastore hepnos::DataStore datastore( "config.yaml" ); // access a nested dataset hepnos::DataSet ds = datastore[ "path/to/dataset" ]; // access run 43 in the dataset hepnos::Run run = ds[43]; // access subrun 56 hepnos::SubRun subrun = run[56]; // access event 25 hepnos::Event ev = subrun[25]; // store data (an std::vector of Particle) st::vector<Particle> vp1 = ...; ev.store(“mylabel”, vp1); // load data std::vector<Particle> vp2; sv.load(“mylabel”, vp2); // iterate over the subruns in a run // using a C++ range-based for for(auto& subrun : run) { ... }

  • Boost for serialization of C++ classes
  • “map”-like interface in DataStore, DataSet,

Run, and Subrun classes

  • Template “load” and “store” methods
  • Iterators to navigate the hierarchy

43

slide-44
SLIDE 44

Taking a step back: other Mochi services

  • FlameStore

Python interface, Python composition

Stores Deep Neural networks

Flat namespace

BAKE storing NumPy arrays

SDSKeyVal storing model metadata in JSON format

Embedded python interpreter to modify models within storage

  • SDSDKV

C interface, C++ composition

Distributed key/value store

Used for the ParSplice application (molecular dynamics)

44

slide-45
SLIDE 45

Lightweight: Source Lines of Code (SLOC)

Component Client Server Other External Users Core Argobots 15,193 Intel, LLNL, Mainz Mercury 27,979 Intel, LBL, LLNL, Mainz Margo 1,625 Intel, LLNL, Mainz Thallium 3,913 SSG 2,203 + 131 (py-ssg) MDCS 906 Microservices SDSKV 1,392 2,881 234 (py-sdskv) BAKE 949 1,273 514 (py-bake) POESIE 343 689 Composed Services HEPnOS 2,689 321 FNAL FlameStore 334 438 Mobject 1,498 5,044 SDSDKV 407 601

45

slide-46
SLIDE 46

Conclusion: use componentization!

  • Monolithic file systems are often suboptimal
  • Data services are better
  • Efficiently building custom data services is a challenge
  • Composed data services is key to productivity

Thank you! Questions?

Personal thanks to the Spack developers who make our lives much easier as we develop these services!

46

This work is in part supported by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357; in part supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative; and in part supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. This work was done in the context of the DOE SSIO project "Mochi" (https://www.mcs.anl.gov/research/projects/mochi/), a Software Defined Storage Approach to Exascale Storage Services.