Webinars series Live demonstrations on the e-infrastructure - - PowerPoint PPT Presentation

webinars series
SMART_READER_LITE
LIVE PREVIEW

Webinars series Live demonstrations on the e-infrastructure - - PowerPoint PPT Presentation

Webinars series Live demonstrations on the e-infrastructure deployment and the risk assessment case studies Topic Date & Time See Webinar recordings: Session 1 (24 Sep 2018) Introduction sessions to the OpenRiskNet e- Session 2


slide-1
SLIDE 1

www.openrisknet.org

Webinars series

Live demonstrations on the e-infrastructure deployment and the risk assessment case studies

Topic Date & Time

Past events Introduction sessions to the OpenRiskNet e- infrastructure See Webinar recordings:

  • Session 1 (24 Sep 2018)
  • Session 2 (27 Sept 2018)
  • Session 3 (4 Oct 2018)
  • Session 4 (30 Oct 2018)

Learn how to deploy the OpenRiskNet virtual research environment See Webinar recordings (25 Feb 2019) Demonstration on data curation and creation of pre-reasoned datasets in the OpenRiskNet framework Monday, 18 March 2019 16:00 CET Identification and linking of data related to AOPWiki (an OpenRiskNet case study) Tuesday, 26 March 2019 17:00 CET Semantic annotation Monday, 1 April 2019 16:00 CET The Adverse Outcome Pathway Database (AOP-DB) Monday, 8 April 2019 16:00 CET Current Event Nextflow and TGX case study Monday, 27 May 2019

https://openrisknet.org/events/

slide-2
SLIDE 2

www.openrisknet.org OpenRiskNet: Open e-Infrastructure to Support Data Sharing, Knowledge Integration and in silico Analysis and Modelling in Risk Assessment Project Number 731075

Nextflow for toxicogenomics-based predictions on the OpenRiskNet Virtual Research Infrastructure

Evan Floden (Centre for Genomic Regulation)

Webinar - 27 May 2019

slide-3
SLIDE 3

www.openrisknet.org

About the project

OpenRiskNet is a 3-year EU Horizon 2020 project with the main objective to develop an open e-infrastructure providing resources and services to a variety of communities requiring risk assessment, including chemicals, cosmetic ingredients, therapeutic agents and nanomaterials.

Main components: ➔ Case-study-driven development - examples of tools to be integrated are selected based on the case study needs. More information: https://openrisknet.org/e-infrastructure/development/case-studies/ ➔ Solutions for all areas by integrating existing tools from consortium and associated partners (via the implementation challenge) ➔ Integrated approach combining experimental data (in vivo, in vitro, in chemico) with analysis, modelling and simulation tools into risk assessment workflows

slide-4
SLIDE 4

www.openrisknet.org

Webinars series

Live demonstrations on the e-infrastructure deployment and the risk assessment case studies

Topic Date & Time

Past events Introduction sessions to the OpenRiskNet e- infrastructure See Webinar recordings:

  • Session 1 (24 Sep 2018)
  • Session 2 (27 Sept 2018)
  • Session 3 (4 Oct 2018)
  • Session 4 (30 Oct 2018)

Learn how to deploy the OpenRiskNet virtual research environment See Webinar recordings (25 Feb 2019) Demonstration on data curation and creation of pre-reasoned datasets in the OpenRiskNet framework Monday, 18 March 2019 16:00 CET Identification and linking of data related to AOPWiki (an OpenRiskNet case study) Tuesday, 26 March 2019 17:00 CET Semantic annotation Monday, 1 April 2019 16:00 CET The Adverse Outcome Pathway Database (AOP-DB) Monday, 8 April 2019 16:00 CET Current Event Nextflow and TGX case study Monday, 27 May 2019

https://openrisknet.org/events/

slide-5
SLIDE 5

www.openrisknet.org

The OpenRiskNet VE

https://prod.openrisknet.org/

slide-6
SLIDE 6

www.openrisknet.org

https://github.com/OpenRiskNet/home/wiki

The OpenRiskNet VE

slide-7
SLIDE 7

www.openrisknet.org

What is Toxicogenomics?

slide-8
SLIDE 8

www.openrisknet.org

External Compute Resources

slide-9
SLIDE 9

www.openrisknet.org

External Data Resources

https://ewels.github.io/sra-explorer/ https://ewels.github.io/AWS-iGenomes/

slide-10
SLIDE 10

www.openrisknet.org

Portable Computation Virtual Infrastructure Application

slide-11
SLIDE 11

www.openrisknet.org

Scientific Workflow Managers

slide-12
SLIDE 12

www.openrisknet.org

Toxicogenomic workflows

  • Data analysis applications performs computation to generate information from

large genomic datasets (resource requirements)

  • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed

cluster

  • Mash-up of many different tools and scripts (dependancies!)
  • Complex dependency trees and configuration → very fragile ecosystem
slide-13
SLIDE 13

www.openrisknet.org

Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292

slide-14
SLIDE 14

www.openrisknet.org

a lot of moving parts 70 tasks 55 external scripts 39 software tools & libraries

slide-15
SLIDE 15

www.openrisknet.org

To reproduce the result of a typical 
 computational biology paper
 requires 280 hours. ≈1.7 months!

slide-16
SLIDE 16

www.openrisknet.org

* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017

slide-17
SLIDE 17

www.openrisknet.org

Platform Amazon Linux Debian Linux Mac OSX Number of chromosomes 36 36 36 Overall length (bp) 32,032,223 32,032,223 32,032,223 Number of genes 7,781 7,783 7,771 Gene density 236.64 236.64 236.32 Number of coding genes 7,580 7,580 7570 Average coding length (bp) 1,764 1,764 1,762 Number of genes with multiple CDS 113 113 111 Number of genes with known function 4,147 4,147 4,142 Number of t-RNAs 88 90 88

Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms *

* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017

slide-18
SLIDE 18

www.openrisknet.org

challenges for risk assessment entering into the omics era

Reproducibility Portability Scalability Usability Traceability

slide-19
SLIDE 19

PUSH-THE-BUTTON PIPELINES

slide-20
SLIDE 20

www.openrisknet.org

  • rchestration

dependencies sharing & reproducibility

Git GitHub

deployment code

The fundamentals for scaleable genomic workflows

slide-21
SLIDE 21

www.openrisknet.org

how to achieve this?

  • Fast prototyping ⇒ custom DSL that enables tasks composition, simplifies most use

cases + general purpose programming lang. for corner cases

  • Easy parallelisation ⇒ declarative reactive programming model based on dataflow

paradigm, implicit portable parallelism

  • Self-contained ⇒ functional approach, a task execution is idempotent ie. cannot

modify the state of other tasks + isolate dependencies with containers

  • Portable deployments ⇒ executor abstraction layer + deployment configuration from

implementation logic

slide-22
SLIDE 22

www.openrisknet.org

task example

bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

slide-23
SLIDE 23

www.openrisknet.org process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

  • utput:

file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }

task example

slide-24
SLIDE 24

www.openrisknet.org

tasks composition

process index_sample { input: file 'sample.bam' from bam_ch

  • utput:

file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

  • utput:

file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }

slide-25
SLIDE 25

www.openrisknet.org

dataflow programming model

  • Declarative computational model for parallel

process executions

  • Processes wait for data, when an input set is

ready the process is executed

  • They communicate by using dataflow

variables i.e. async FIFO queues called channels

  • Parallelisation and tasks dependencies are

implicitly defined by process in/out declarations

slide-26
SLIDE 26

www.openrisknet.org

How parallelisation works

data x data y data z task 1 task 2 task 3 data z channel process

  • ut z

data y data x

  • ut y
  • ut x
slide-27
SLIDE 27

www.openrisknet.org

how parallelisation works

samples_ch = Channel.fromPath('data/sample.fastq') process FASTQC { input: file reads from samples_ch

  • utput:

file 'fastqc_logs' into fastqc_ch script: """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

slide-28
SLIDE 28

www.openrisknet.org samples_ch = Channel.fromPath(‘data/*.fastq') process FASTQC { input: file reads from samples_ch

  • utput:

file 'fastqc_logs' into fastqc_ch script: """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

how parallelisation works

slide-29
SLIDE 29

www.openrisknet.org

implicit parallelism

clustalo

Channel.fromPath("data/*.fastq")

clustalo FASTQC

slide-30
SLIDE 30

www.openrisknet.org

handling file pairs

Channel.fromFilePairs("*_{1,2}.fq") ( gut, [gut_1.fq, gut_2.fq] ) ( lung, [lung_1.fq, lung_2.fq] ) ( liver, [liver_1.fq, liver_2.fq] ) gut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq

slide-31
SLIDE 31

www.openrisknet.org

basic example

process FASTQC { input: set pair_id, file(reads) from samples_ch

  • utput:

file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

( gut, [gut_1.fq, gut_2.fq] ) ( lung, [lung_1.fq, lung_2.fq] ) ( liver, [liver_1.fq, liver_2.fq] )

slide-32
SLIDE 32

www.openrisknet.org

deployment scenarios

slide-33
SLIDE 33

www.openrisknet.org

local execution

  • Common development scenario
  • Dependencies can be managed using a

container runtime

  • Parallelisations is managed spawning

posix processes

  • Can scale vertically using fat server /

shared mem. machine

nextflow OS local storage docker/singularity laptop / workstation

slide-34
SLIDE 34

www.openrisknet.org

centralised orchestration

computer cluster

  • Nextflow orchestrates workflow execution

submitting jobs to a compute cluster eg. SLURM

  • It can run in the head node or a compute

node

  • Requires a shared storage to exchange

data between tasks

  • Ideal for corse-grained parallelisms

NFS/Lustre cluster node cluster node cluster node cluster node submit jobs cluster node nextflow

slide-35
SLIDE 35

www.openrisknet.org

distributed orchestration

login node NFS/Lustre job request cluster node cluster node launcher wrapper nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker HPC cluster

  • A single job request allocates the desired

computes nodes

  • Nextflow deploys its own embedded compute

cluster

  • The main instance orchestrate the workflow

execution

  • The worker instances execute workflow jobs

(work stealing approach)

slide-36
SLIDE 36

www.openrisknet.org

AWS batch deployment

AWS Batch EC2 Spot Instance Amazon S3 AWS Cloud Task Container

Nextflow tasks

Task submission EC2 Container Registry nextflow run -with-batch

slide-37
SLIDE 37

www.openrisknet.org

kubernetes / OpenShift

  • Next generation native cloud

clustering for containerised workloads

  • There's the need of workflow
  • rchestration
  • K8S executor works well with

OpenShift

slide-38
SLIDE 38

www.openrisknet.org

portability

slide-39
SLIDE 39

www.openrisknet.org

portability

process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

slide-40
SLIDE 40

www.openrisknet.org

portability

process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

slide-41
SLIDE 41

www.openrisknet.org

configuration decoupling 
 is the key to portable deployments

slide-42
SLIDE 42

CONTAINERISATION

slide-43
SLIDE 43

www.openrisknet.org

container vs. VM

  • Lighter: MB vs GB
  • Faster startup: ms/secs vs minutes
  • Virtualise a process/application instead of a OS/

Hardware

  • Immutable: don't change over time, thus

guarantee replicability over executions.

  • Composable: the output of one container is

directly consumable as input by another container.

  • Transparent: they are created with a well defined

automated procedure.

slide-44
SLIDE 44

www.openrisknet.org

containerisation

  • Nextflow envisioned the use
  • f software containers to fix

computational reproducibility

  • Mar 2014 (ver 0.7), support

for Docker

  • Dec 2016 (ver 0.23), support

for Singularity

Nextflow job job job

slide-45
SLIDE 45

www.openrisknet.org

  • Community effort to collect

production ready analysis pipelines built with Nextflow

  • Initially supported by

SciLifeLab, QBiC and A*Star Genome Institute Singapore

  • https://nf-co.re

Alexander 
 Peltzer Phil Ewels Andreas Wilm

slide-46
SLIDE 46

www.openrisknet.org

execution reports

slide-47
SLIDE 47

www.openrisknet.org

execution reports

slide-48
SLIDE 48

www.openrisknet.org

execution reports

slide-49
SLIDE 49

www.openrisknet.org

execution timelines

slide-50
SLIDE 50

www.openrisknet.org

dag visualisation

slide-51
SLIDE 51

www.openrisknet.org

code editors + syntax highlighting

slide-52
SLIDE 52

www.openrisknet.org

In production since 2014

slide-53
SLIDE 53

www.openrisknet.org

Nextflow in the OpenRiskNet VE

slide-54
SLIDE 54

www.openrisknet.org

Hello World, Hello OpenRiskNet

  • 1. SSH into VE (ssh -i ~/.ssh/openrisknet evan@130.238.28.49)
  • 2. oc login https://prod.openrisknet.org -u developer
  • 3. oc project nextflow
  • 4. nextflow kuberun nextflow-io/hello -v nf-0001

https://github.com/nextflow-io/hello

slide-55
SLIDE 55

www.openrisknet.org

slide-56
SLIDE 56

www.openrisknet.org

RNA-Seq Analysis

https://github.com/nextflow-io/rnaseq-nf

  • 1. nextflow kuberun nextflow-io/rnaseq-nf -v nf-0002
  • 2. See pod with `oc get pod`
slide-57
SLIDE 57

www.openrisknet.org

Hybrid & Bursting Into the Public Cloud

  • 1. Was configure / NF Config / Env Variables
  • 2. nextflow kuberun nextflow-io/rnaseq-nf -r hybrid —v nf-0002
  • 3. https://github.com/nextflow-io/rnaseq-nf/tree/hybrid
  • 4. https://cbcrg.signin.aws.amazon.com/console
slide-58
SLIDE 58

www.openrisknet.org

Hybrid & Bursting Into the Public Cloud

AWS Batch EC2 Spot Instance Amazon S3 AWS Cloud Task Container

Nextflow tasks

Task submission EC2 Container Registry nextflow run -with-batch

slide-59
SLIDE 59

www.openrisknet.org

External datasources & data localisation

https://github.com/ewels/AWS-iGenomes

slide-60
SLIDE 60

www.openrisknet.org

Continuing Work

  • Integration with the JypyterHub launcher https://jupyterhub-

jupyter.prod.openrisknet.org

  • Consolidate the Toxicogenomics case study at https://github.com/

OpenRiskNet/nf-toxomix

  • Expand the case study to include other datasets
slide-61
SLIDE 61

www.openrisknet.org

Report on computational and data federation

  • ut soon on the OpenRiskNet website!
slide-62
SLIDE 62

www.openrisknet.org

Acknowledgements

OpenRiskNet (Grant Agreement 731075) is a project funded by the European Commission within Horizon 2020 Programme Project partners:

P1 Douglas Connect GmbH, Switzerland (DC) P2 Johannes Gutenberg-Universität Mainz, Germany (JGU) P3 Fundacio Centre De Regulacio Genomica, Spain (CRG) P4 Universiteit Maastricht, Netherlands (UM) P5 The University Of Birmingham, United Kingdom (UoB) P6 National Technical University Of Athens, Greece (NTUA) P7 Fraunhofer Gesellschaft Zur Foerderung Der Angewandten Forschung E.V., Germany (Fraunhofer) P8 Uppsala Universitet, Sweden (UU) P9 Medizinische Universität Innsbruck, Austria (MUI) P10 Informatics Matters Limited, United Kingdom (IM) P11 Institut National De L’environnement Et Des Risques INERIS, France (INERIS) P12 Vrije Universiteit Amsterdam, Netherlands (VU)

slide-63
SLIDE 63

www.openrisknet.org

Webinars series

Live demonstrations on the e-infrastructure deployment and the risk assessment case studies

Topic Date & Time

Past events Introduction sessions to the OpenRiskNet e- infrastructure See Webinar recordings:

  • Session 1 (24 Sep 2018)
  • Session 2 (27 Sept 2018)
  • Session 3 (4 Oct 2018)
  • Session 4 (30 Oct 2018)

Learn how to deploy the OpenRiskNet virtual research environment See Webinar recordings (25 Feb 2019) Demonstration on data curation and creation of pre-reasoned datasets in the OpenRiskNet framework Monday, 18 March 2019 16:00 CET Identification and linking of data related to AOPWiki (an OpenRiskNet case study) Tuesday, 26 March 2019 17:00 CET Semantic annotation Monday, 1 April 2019 16:00 CET The Adverse Outcome Pathway Database (AOP-DB) Monday, 8 April 2019 16:00 CET Current Event Nextflow and TGX case study Monday, 27 May 2019

https://openrisknet.org/events/