Using workflow managers to co-ordinate multistep analysis pipelines - PowerPoint PPT Presentation

Using workflow managers to co-ordinate multistep analysis pipelines across multiple compute nodes in a reproducible manner.

Traditional HPC jobs are single monolithic programs using multi-node parallelism

Today many researchers use notebooks on clusters to do interactive/interpretive analysis of datasets

Research computing spectrum ? Single, large, long Single core, quick running, running, multimode interpretive analysis jobs Regression analysis e.g. a climate model of a (quite) big dataset

my_analysis.r clean_data.py clean_data.py my_analysis.r clean_data.py my_analysis.r combined_result clean_data.py my_analysis.r clean_data.py my_analysis.r clean_data.py my_analysis.r

Carrying out multi-step analyses by hand

Reproducibility results = code(data) Typing at a terminal is BAD NEWS for reproducibility • Notebooks (for low intensity work) • Containers • Neither very easily work with multi-node parallelism •

1.Easy/Automatic 2.Reproducible 3.Generalizable/Scalable

Workflow manager Specify dependencies between tasks • Check if which dependencies need updating • Only run tasks that need updating • Do all this unsupervised. • Step1.gz Step1.gz Step1.gz Step2.dat Step2.dat Step3.txt Step3.txt

Modern workflow managers Either DSLs, configuration based or library • Allow more complex forms of dependency • Automatically submit each job to the cluster • Monitor for successful completion and • automatically submit next job Parameterizable • Extensive logging •

Modern Workflow managers May also provide: • conda / singularity / docker integration • Use cloud compute and/or storage as well as local cluster • Allow (distributed) execution of arbitrary code as well as shell scripts • Helper functions for common analysis tasks

Some modern WFM Ease of development SNAKEMAKE Flexibility, Scalability, portability, customisation performance

SNAKEMAKE Language Python DSL DSL Dependency Explicit Implicit (pull) Implicit (push) Paradigm Rich dependency Yes Partial Yes graphs Conda integration Yes Yes Yes Singularity/docker Coming soon Yes Yes Arbitrary code Python Python Any interpreted Cloud Execution No Kubernates Amazon Batch Cloud storage Google/S3 Many Many Functions for Yes No No common analysis

Demonstration

It should take less time, effort and thought to it the right way than to do it the wrong way

Gene profiles GAGAGCGCGAGATAGAGACAGTGAGACTATCATAGAGAGCGCGAGATAGAGACAGTGAGACTATCATAGAGAGCGCGAGATAGAGACAGTGAGACTATCATAGAGAGCGCGAGATAGAGACAGTGAGACTATCATAGAGAGCGCGAGATAGA GAGACTATCATAGAG Millions of lines TAGAGAGCGCGAGAT TATCATATATCATAG GAGACTATCATAGAG TAGAGAGCGCGAGAT TATCATATATCATAG GAGACTATCATAGAG TAGAGAGCGCGAGAT TATCATATATCATAG GAGACTATCATAGAG TAGAGAGCGCGAGAT TATCATATATCATAG

Ruffus dependency types Originate None to one Transform One to One Split One to many Merge Many to one Collate Many to fewer Subdivide Many to more Follows Dependency without common files Files Arbitrary relationship Permutations Product Combinatorics Combinations Combinations_with_replacement

Pipelines can get quite complex… pipeline_mapping

Really very complicated!

Summary • Automated farming and monitoring of pipelines of jobs to the cluster • Create fully logged and reproducible workflows • Generalizable and scalable • Should be easer than writing a SGE submission script and faster than running in an interactive session • Install with conda install –c bioconda –c conda-forge cgatcore

Acknowledgements Sudbery Lab for MRC Computational Computational Genomics Analysis and Training/Tools Genomics @ TUOS Dr. Cristina Alexandru-Crivac Dr. Adam Cribs Jaime Alvarez-Benayas Sebastian Luna-Valero Justin Coyne Dr. Charlotte George Magdelena Dabrowska Dr. Antonio Berlanga-Taylor Sumeet Deshmurkh Dr. Stephen Sansom Jacob Parker Dr. Tom Smith Ivaylo Yonchev Dr. Nicholas Ilott Dr. Jethro Johnson Jakub Scaber Dr. Katherine Brown Dr. David Sims Dr. Andreas Heger Dr. Leo Goodstat (Ruffus)

https://cgaticore.readthedocs.io Cribbs AP, et al. F1000Research 2019, 8:377 https://snakemake.readthedocs.io Köster, J and Rahmann, S. Bioinformatics 2012, 28:2520 https://nextflow.io P. Di Tommaso, et al. Nature Biotechnology 2017 35, 316

Using workflow managers to co-ordinate multistep analysis pipelines - PowerPoint PPT Presentation

Using workflow managers to co-ordinate multistep analysis pipelines across multiple compute nodes in a reproducible manner. Traditional HPC jobs are single monolithic programs using multi-node parallelism Today many researchers use notebooks on

IMEX Linear Multistep Methods for Stiff Hyperbolic Relaxation Systems Willem Hundsdorfer, CWI,

Peoplesoft Workflow Peoplesoft Workflow Technology Technology Putting Customer First SOA IT

STAR-CCM+ in your Workflow Bill Jester, CD-adapco STAR-CCM+ in your workflow Contents

Day 8 Workflow Cloud Resource Provisioning Todays Agenda Introduction What is workflow?

workflow: workflow: QSPR = Quantitative Structure Property

A Workflow Workflow for for Retrieving Retrieving Orthologous Orthologous A Promoters and I

An Empirical Investigation of Direct and Iterated Multistep Conditional Forecasts Michael W.

Multistep Single-Field Strong Phase Transitions from New Fermions Peisi Huang University of

Notes Adams-Bashforth Adams-Bashforth family are examples of Notes for last part of Oct 11

Design of a Petri Net-based Design of a Petri Net-based Workflow Engine Workflow Engine Simone

Introduction to CONNJUR Workflow Builder and Yes Workflow 2017 Summer Workshop: June 29, 2017

Kap. 12 Workflow Management in ERP-Systemen 12.1 Workflow Management: Konzepte 12.2 Einbindung

Module 4 - Smoothing the Workflow with the Kanban Best Practices Establishing an Even Workflow

Diagnostic Information for Control-Flow Analysis of Workflow Graphs (aka Free-Choice Workflow Nets)

Cbio 16S analysis pipeline Katie Lennard Microbiome analysis workflow Data preprocessing (UCT

Co-ordinate mind, body and spirit Living, loving and longevity Health, happiness and harmony

Damstra Technology ASX Small and Mid-Cap Conference 2020 10 September 2020 Financial data is

Data-Centric Workflow and Business Processes Victor Vianu and Jianwen Su Outline n Introduction

pomsets Workflow management for your cloud Michael Pan nephosity In the future, the rapidity

Learning What and Where to Transfer Yunhun Jang* 1,2 , Hankook Lee* 1 , Sung Ju Hwang 3,4,5 ,

ts t

An Introduction to Empirical Support of Efficient Market Hypothesis Behavioral Finance

Free-cut elimination in linear logic and an application to a feasible arithmetic Anupam Das

Linear Programming in Optimal Classical Planning Blai Bonet Universidad Sim on Bol var,