Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso - - PowerPoint PPT Presentation
Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso - - PowerPoint PPT Presentation
Enabling Reproducible Computing on the EPOS ICS-D Alessandro Spinuso (KNMI), Daniele Bailo (INGV), Matser Jonas (KNMI) Chris Card (BGS), Jean-Baptiste Roquencourt (BRGM) Wayne Shelley (BGS) Computational Earth Science (CES) European Plate
Computational Earth Science (CES)
European Plate Observing System
Long-term plan to facilitate integrated use of data, data products, and facilities from distributed research infrastructures for solid Earth science in Europe.
ICS-D The distributed Integrated Core
Services element of EPOS.
- ComputaAonal and Data Storage
Infrastructures (HPC, Cloud)
- Services of general interest
(Data publishing services, external metadata catalogues, AAAI).
- Earthquake Simulation
Produce synthetic seismograms for Earth models and earthquakes via the execution of HPC simulation software (SPECFEM3D - SPECFEMGLOBE)
- Data processing & Misfit Analysis
Observed data and synthetics are processed and compared via Data Intensive methods. Data accessed from federated international archives (FDSN), indexed and reused
CES: Earthquake Simulation VRE’s (portal.verce.eu)
HPC
S-PROV ProvONE PROV-O
Traceable and Combinable Computations across Workspaces (portal.verce.eu)
ICS-D HPC (SCAI) ICS-D Cloud
CES embedded into the EPOS portal workspaces
Distributed Data Discovery through ICS-C Catalogue
From data discovery to analysis in dedicated workspaces Spatial integration Temporal integration Processing
- stage the distributed raw data onto computaMonal
environment to develop/apply custom methods.
- apply preprocessing workflows to the
raw data before custom analysis.
- be informed about libraries that fit the selected
data and use them.
- update the raw data that is already in the
computaMonal environment
- keep old versions of raw data
(Reproducibility / Comparison)
- archive the state of their environment
(Track Progress / Restore)
CES embedded into the EPOS portal workspaces
Data Resource A Data Resource B
Processing Workspace
Workflow for Data-Staging & Preprocessing
a researcher wants to
Notebook Service, architecture and requirements
ICS-D
AAAI Contextualisation
Workers Workflow Data-Stage & Preprocessing Raw Data Volume Results Notebook pages lib requirements
Notebook Containers Workflow Containers
Workers Workflow Data-Stage & Preprocessing Raw Data Volume Workflow Data-Staging & Preprocessing Raw Data Volume Notebook Service API
- Read-only and extensible input data
(staging_history).
- Workflow and Notebook container(s) share
Volumes (Workflow as a Service).
- Libraries selectable from the EPOS ICS
catalogue.
- Users’s Data Volumes archived on-
demand with notebook pages and library requirements (snapshot).
- Controlled by the EPOS GUI through
a dedicated API.
Notebook Service, architecture and requirements
ICS-D
AAAI Contextualisation
Workers Workflow Data-Stage & Preprocessing Raw Data Volume Results Notebook pages lib requirements
Notebook Containers Workflow Containers
Workers Workflow Data-Stage & Preprocessing Raw Data Volume Workflow Data-Stage & Preprocessing Raw Data Volume Notebook Service API
- Read-only and extensible input data.
- Workflow and Notebook container(s) share
Volumes (Workflow as a Service).
- Libraries selectable from the EPOS ICS
catalogue.
- Users’s Data Volumes archived on-
demand with notebook pages and library setup (snapshot).
- Controlled by the EPOS GUI through
a dedicated API.
Similar systems we learn from
Build and Run Docker images from Github Repositories with notebook pages. Environment version Control
Notebook Service API Specification
- Creation and management of a notebook instances and its snapshots.
- Upload and execution of workflows. (data-staging, on-demand preprocessing).
- Workflows runs are associated with an active notebook through a notebookID.
- API implemented adopting REST Verbs to manage workflows, notebooks, snapshots and runs.
Workflows Role in EPOS
Objective: Performing routine operations as well as custom computations (at scale). Technology:
- Common Workflow Language for portable and scalable descriptions (CWLTool)
- Dispel4py Python based workflow:
○ Parallel Streaming computational API. ○ Multiple mappings (HPC, Cloud, MultiProcessing). ○ Customisable provenance capture and semantic contextualisation.
Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)
Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)
Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)
State
Actors I/O Data and Metadata
Multilayered provenance Semantic Clustering Process Delegation Resource Mapping
Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)
Agents (Abstract Workflow & User’s Context) Agents (Concrete Software Actors)
State
Actors I/O Data and Metadata
Multilayered provenance Semantic Clustering Process Delegation Resource Mapping
Data-Intensive Provenance Model: Streaming Stateful Operators (S-PROV)
Further Extension: notebook snapshots’ dependencies and users’ configurations.
API Methods Provenance acquisition (bulks). Monitoring and lineage queries. Contextual Metadata discovery. Comprehensive summaries. Export to PROV formats. (PROV-XML/RDF Turtle)
S-ProvFlow Data-Intensive provenance as a Service
https://github.com/KNMI/s-provenance
API Methods Provenance acquisition (bulks). Monitoring and lineage queries. Contextual Metadata discovery. Comprehensive summaries. Export to PROV formats. (PROV-XML/RDF Turtle)
S-ProvFlow Data-Intensive provenance as a Service
https://github.com/KNMI/s-provenance
Conclusions
- Balance between automation and user’s control in coupling data discovery
and processing.
- Exploiting containerised software and infrastructures integrating Workflows
and Notebooks associated with EPOS Workspaces.
- Workflows as a Service (WaaS) for routine operations.
Provenance and contextual metadata for validation and traceability.
- Reproducibility mechanisms (in progress).
- Resilient to changes at remote data providers (staging_history).
- On-demand Archiving and Restore of intermediate progress (snapshots).