Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso - - PowerPoint PPT Presentation

enabling reproducible computing on the epos ics d
SMART_READER_LITE
LIVE PREVIEW

Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso - - PowerPoint PPT Presentation

Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso (KNMI), Daniele Bailo (INGV), Matser Jonas (KNMI) Chris Card (BGS), Jean-Baptiste Roquencourt (BRGM) Wayne Shelley (BGS) Computational Earth Science (CES) European Plate


slide-1
SLIDE 1

Enabling Reproducible Computing on the EPOS ICS-D

Alessandro Spinuso (KNMI), Daniele Bailo (INGV), Matser Jonas (KNMI) Chris Card (BGS), Jean-Baptiste Roquencourt (BRGM) Wayne Shelley (BGS)

slide-2
SLIDE 2

Computational Earth Science (CES)

European Plate Observing System

Long-term plan to facilitate integrated use of data, data products, and facilities from distributed research infrastructures for solid Earth science in Europe.

ICS-D The distributed Integrated Core

Services element of EPOS.

  • ComputaAonal and Data Storage

Infrastructures (HPC, Cloud)

  • Services of general interest

(Data publishing services, external metadata catalogues, AAAI).

slide-3
SLIDE 3
  • Earthquake Simulation

Produce synthetic seismograms for Earth models and earthquakes via the execution of HPC simulation software (SPECFEM3D - SPECFEMGLOBE)

  • Data processing & Misfit Analysis

Observed data and synthetics are processed and compared via Data Intensive methods. Data accessed from federated international archives (FDSN), indexed and reused

CES: Earthquake Simulation VRE’s (portal.verce.eu)

slide-4
SLIDE 4

HPC

S-PROV ProvONE PROV-O

Traceable and Combinable Computations across Workspaces (portal.verce.eu)

ICS-D HPC (SCAI) ICS-D Cloud

slide-5
SLIDE 5

CES embedded into the EPOS portal workspaces

Distributed Data Discovery through ICS-C Catalogue

From data discovery to analysis in dedicated workspaces Spatial integration Temporal integration Processing

slide-6
SLIDE 6
  • stage the distributed raw data onto computaMonal

environment to develop/apply custom methods.

  • apply preprocessing workflows to the

raw data before custom analysis.

  • be informed about libraries that fit the selected

data and use them.

  • update the raw data that is already in the

computaMonal environment

  • keep old versions of raw data

(Reproducibility / Comparison)

  • archive the state of their environment

(Track Progress / Restore)

CES embedded into the EPOS portal workspaces

Data Resource A Data Resource B

Processing Workspace

Workflow for Data-Staging & Preprocessing

a researcher wants to

slide-7
SLIDE 7

Notebook Service, architecture and requirements

ICS-D

AAAI Contextualisation

Workers Workflow Data-Stage & Preprocessing Raw Data Volume Results Notebook pages lib requirements

Notebook Containers Workflow Containers

Workers Workflow Data-Stage & Preprocessing Raw Data Volume Workflow Data-Staging & Preprocessing Raw Data Volume Notebook Service API

  • Read-only and extensible input data

(staging_history).

  • Workflow and Notebook container(s) share

Volumes (Workflow as a Service).

  • Libraries selectable from the EPOS ICS

catalogue.

  • Users’s Data Volumes archived on-

demand with notebook pages and library requirements (snapshot).

  • Controlled by the EPOS GUI through

a dedicated API.

slide-8
SLIDE 8

Notebook Service, architecture and requirements

ICS-D

AAAI Contextualisation

Workers Workflow Data-Stage & Preprocessing Raw Data Volume Results Notebook pages lib requirements

Notebook Containers Workflow Containers

Workers Workflow Data-Stage & Preprocessing Raw Data Volume Workflow Data-Stage & Preprocessing Raw Data Volume Notebook Service API

  • Read-only and extensible input data.
  • Workflow and Notebook container(s) share

Volumes (Workflow as a Service).

  • Libraries selectable from the EPOS ICS

catalogue.

  • Users’s Data Volumes archived on-

demand with notebook pages and library setup (snapshot).

  • Controlled by the EPOS GUI through

a dedicated API.

Similar systems we learn from

Build and Run Docker images from Github Repositories with notebook pages. Environment version Control

slide-9
SLIDE 9

Notebook Service API Specification

  • Creation and management of a notebook instances and its snapshots.
  • Upload and execution of workflows. (data-staging, on-demand preprocessing).
  • Workflows runs are associated with an active notebook through a notebookID.
  • API implemented adopting REST Verbs to manage workflows, notebooks, snapshots and runs.
slide-10
SLIDE 10

Workflows Role in EPOS

Objective: Performing routine operations as well as custom computations (at scale). Technology:

  • Common Workflow Language for portable and scalable descriptions (CWLTool)
  • Dispel4py Python based workflow:

○ Parallel Streaming computational API. ○ Multiple mappings (HPC, Cloud, MultiProcessing). ○ Customisable provenance capture and semantic contextualisation.

slide-11
SLIDE 11

Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)

Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)

slide-12
SLIDE 12

Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)

State

Actors I/O Data and Metadata

Multilayered provenance Semantic Clustering Process Delegation Resource Mapping

Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)

slide-13
SLIDE 13

Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)

State

Actors I/O Data and Metadata

Multilayered provenance Semantic Clustering Process Delegation Resource Mapping

Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)

Further Extension: notebook snapshots’ dependencies and users’ configurations.

slide-14
SLIDE 14

API Methods Provenance acquisition (bulks). Monitoring and lineage queries. Contextual Metadata discovery. Comprehensive summaries. Export to PROV formats. (PROV-XML/RDF Turtle)

S-ProvFlow Data-Intensive provenance as a Service

https://github.com/KNMI/s-provenance

slide-15
SLIDE 15

API Methods Provenance acquisition (bulks). Monitoring and lineage queries. Contextual Metadata discovery. Comprehensive summaries. Export to PROV formats. (PROV-XML/RDF Turtle)

S-ProvFlow Data-Intensive provenance as a Service

https://github.com/KNMI/s-provenance

slide-16
SLIDE 16

Conclusions

  • Balance between automation and user’s control in coupling data discovery

and processing.

  • Exploiting containerised software and infrastructures integrating Workflows

and Notebooks associated with EPOS Workspaces.

  • Workflows as a Service (WaaS) for routine operations.

Provenance and contextual metadata for validation and traceability.

  • Reproducibility mechanisms (in progress).
  • Resilient to changes at remote data providers (staging_history).
  • On-demand Archiving and Restore of intermediate progress (snapshots).