Reproducibility in the Cloud Rawaa Qasha, Jacek Caa, Paul Watson - - PowerPoint PPT Presentation

reproducibility in the cloud
SMART_READER_LITE
LIVE PREVIEW

Reproducibility in the Cloud Rawaa Qasha, Jacek Caa, Paul Watson - - PowerPoint PPT Presentation

A Framework for Scientific Workflow Reproducibility in the Cloud Rawaa Qasha, Jacek Caa, Paul Watson Newcastle University, Newcastle upon Tyne, UK Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk In this paper A new framework for


slide-1
SLIDE 1

A Framework for Scientific Workflow Reproducibility in the Cloud

Rawaa Qasha, Jacek Cała, Paul Watson

Newcastle University, Newcastle upon Tyne, UK Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk

slide-2
SLIDE 2

In this paper

  • A new framework for repeatability and reproducibility of

scientific workflow

  • Integrating

logical and physical preservation approaches

  • Offering Workflow/tasks repositories with version

control

  • Supporting automatic deployment and image capture of

workflows and tasks

2

slide-3
SLIDE 3
  • Background
  • Challenges for workflow reproducibility
  • Our solution for logical and physical preservations
  • Overview of reproducibility framework
  • Experiments and results
  • Conclusions

Outline

3

slide-4
SLIDE 4

Workflows & Reproducibility

4 92 1443 18 (~20%) 341 (~24%) 200 400 600 800 1000 1200 1400 1600

study1* study2**

Number of workflows

total no. of workflows Workflows can be re-excuted *Zhao et al, “Why workflows break Understanding and combating decay in Taverna workflows,” 2012

**Mayer et al, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 2015

slide-5
SLIDE 5
  • Insufficiently detailed workflow description
  • Insufficient description of the execution

environment

  • Unavailable execution environments
  • Absence of & changes in the external

dependencies

  • Missing input data

5

Challenges for workflow reproducibility

slide-6
SLIDE 6

6

Common reproducibility approaches

T1 T2 T4 T3 Logical preservation Physical preservation

slide-7
SLIDE 7

Using TOSCA as a logical preservation

7

Node Type T1 T2 T4 T3 Relationship Type Node Template (T4) Node Template (T1) Node Template (T3) Node Template (T2)

Service Template

Workflow and execution environment description

slide-8
SLIDE 8

8

Image creation Container With Depend.

base image Task image

Container creation

Data

Task artifact Tools & Libs.

(a) Initial task deployment & execution

Task image

Container creation

Data

(b) Task deployment & execution with task image

Using Docker for physical preservation

Preserving execution environment and dependencies, tracking changes

slide-9
SLIDE 9

9

Task/WF Repository (GitHub) Images Repository (Docker Hub)

LifeCycle Scripts Basic Types

Workflow Deployment & Enactment Engine (TOSCA Runtime Environment: Cloudify) Automated Image Creation

Target Execution Environment (Docker over local VM, AWS, Azure, GCE, …)

Core Repository (GitHub)

Reproducibility Framework

slide-10
SLIDE 10

10

Multi-container deployment

slide-11
SLIDE 11

11

Single container deployment

slide-12
SLIDE 12

12

Time line of workflow devOps

slide-13
SLIDE 13

13

Workflow repository

Preserving description, input data, tracking changes and deployment instructions

slide-14
SLIDE 14

14

Experiments and Results

slide-15
SLIDE 15

15

1- Repeatability of a workflow on different clouds

slide-16
SLIDE 16

16

2- Automatic image capture for improved performance

slide-17
SLIDE 17

17

3- Reproducibility in the face of development changes

slide-18
SLIDE 18

Conclusions

18

  • Full workflow reproducibility is a long-standing issue
  • TOSCA description is used for logical preservation
  • Docker images for tasks/workflows support physical

preservation

  • Changes tracking and automatic deployment also contribute to

a comprehensive solution of the problem

  • Integration of these techniques addresses majority of the

issues related to workflow decay

slide-19
SLIDE 19

THANK YOU