Pipelines and Workflows Adapting to HPC Funding Partners - PowerPoint PPT Presentation

Pipelines and Workflows Adapting to HPC Funding Partners bioexcel.eu

Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_US This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that this presentation contains images owned by others. Please seek their permission before reusing these images. bioexcel.eu

Outline • Bioinformatics pipelines and HPC • What’s the problem? • MapReduce - another parallel pattern • Hadoop – a MapReduce engine • Halvade: distributing pipelines • Common Workflow Languages and Pipeline Frameworks bioexcel.eu

Bioinformatics pipelines and HPC what’s the problem? • Many bioinformatics tools parallelised using threading only • Best suited to shared-memory machines with a large amount of memory and modest numbers of cores (compared to HPC) • On distributed-memory systems, restricted to single node • Indications are many multithreaded tools used in pipelines don’t scale well to typical full number of single-node cores (low parallel efficiency) • Few tools use MPI • Not the only way to scale, but an important / powerful one that would give instant ability to run on HPC machines bioexcel.eu

Bioinformatics pipelines and HPC what’s the problem? Example: user@machine:~> bwa mem -t $NSLOTS -M $BWA_INDEX_REF -R "@RG\tID:$PU\tPL:illumina\tPU:$PU\tSM:$SAMPLE" $READS1 $READS2 | samblaster --splitterFile >(samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.sr.bam) -- discordantFile >(samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.disc.bam) | samtools view -hSu /dev/stdin | samtools sort -@ $NSLOTS /dev/stdin > $SAMPLE.raw.bam bioexcel.eu

Bioinformatics pipelines and HPC what’s the problem? Example simplified : A | B --arg1 >( C | D > file1) --arg2 >( E | F > file2) | G | H > file3 • 8 executables (A – H) • 4 multithreaded • potentially using different threading standards (pthreads, OpenMP, …) • Efficiency of parallel executables unknown • No idea of optimal number of threads to assign to each (assuming we can) • Linux pipes don’t allow straightforward control over parallel execution • Mostly relying on operating system to do the right thing • Data flow through pipes and pipe buffers adds addtional complications bioexcel.eu

Bioinformatics pipelines and HPC what’s the problem? • Can take days to run • Difficult to determine which stages are taking the most time • Can’t use many standard performance profiling tools • Difficult to understand, let alone improve runtime • Need to carefully tease apart performance of each step • Still leaves questions regarding dataflow between piped commands • Linux piping may be very efficient in some circumstances • but don’t have a good handle on this • Risk is this approach became common because it was available, convenient (interactive), and sufficiently fast for smaller data sets • Possibly running into problems with more and more data • Issues surrounding robustness, checkpointing (can’t run > 48 hours on ARCHER) • Need a solution engineered for purpose to tackle larger scale in a controlled manner bioexcel.eu

MapReduce – another parallel pattern • Like Task Farm, but three categories of worker: • Mapper (user supplies this code) • Takes a (local) list of key-value pairs, and for each pair, returns another (intermediate) key-value pair • Writes these out locally • Grouper (part of the run-time), can be done by the master • Groups by (intermediate) key on local disk • Reducer (user suppliers this code) • One reducer for each (intermediate) key • Takes the (intermediate) key-value pairs from all relevant disks, performs a reduction operation and returns another (usually) shorter list of (final) key- value pairs bioexcel.eu

Hadoop • Hadoop is an implementation of the MapReduce pattern • Engine to manage execution of many data processing tasks • Includes a distributed filesystem • Used as an engine to distribute data to where it’s needed, including to perform compute and gather results bioexcel.eu

Halvade • Example framework to run sequencing pipelines in parallel • based on Hadoop • multicore AND multi-node • alignment (BWA) à data preparation (Picard) à variant calling (GATK) • Developers analysed • single-node threaded efficiency of each tool • used to allocate optimum number of threads for each tool in Halvade • Optimal task size (experimentally) - found granularity sweet spot • Good parallel efficiency (~90%) running whole human genome on distributed-memory cluster (up to 16 nodes - 512 cores, 60GB memory per node) bioexcel.eu

Barriers • Parallel pipeline performance of Halvade and similar that rely on Hadoop (or Spark) seem promising • Barrier to deployment on especially the largest HPC systems is conflict with existing file systems, resource management, etc. • Potential disruption of environment finely tuned for performance of tightly-coupled traditional HPC codes? • Not enough demand or incentive? • Gap between (potential) users and service providers? • HPC hardware and service mainly funded by non-bioinformaticians? • … ? • Smaller HPC machines more flexible • EPCC HPC service Cirrus used for genomics analyses, has Hadoop, Spark bioexcel.eu

Common Workflow Languages and Pipeline Frameworks • Currently observe a diversity of CWLs and pipeline frameworks • Varying approaches to parallelism • Future prospects of any given framework uncertain • Hampers adoption / traction, decreases incentive for continued development • Similarities with earlier decades of parallel computing: • Early diversity of message-passing libraries • Later adoption of a common standard • Allowed applications to be written in parallel once and run everywhere • Single standard for parallelism enabled targeting of efforts to improve performance from software and hardware sides • Great success - led to portable applications that can run on many different and constantly newer HPC machines bioexcel.eu

Common Workflow Languages and Pipeline Frameworks • CWL and pipeline frameworks and their users might benefit from similar • Single common framework could potentially improve uptake, deployment and usage on HPC machines, similar to historical emergence of MPI • Software development efforts could focus on parallelisation of CWL execution layer(s) • Would allow separation of concerns, with relevant expertise directed at each level of software and hardware bioexcel.eu

Summary • Some bioinformatics pipelines / workflows are problematic from the point of view of HPC • Understanding and improving performance is challenging • Frameworks for massively parallel data-centric computing appear promising • There are some barriers to uptake / deployment on HPC systems • Situation is made more complicated by diversity of approaches without one unambiguous favourite bioexcel.eu

Pipelines and Workflows Adapting to HPC Funding Partners - PowerPoint PPT Presentation

Pipelines and Workflows Adapting to HPC Funding Partners bioexcel.eu Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Pegasus : Introducing Integrity to Scientific Workflows Karan Vahi vahi@isi.edu

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Dryad Shell Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 Cluster

GPCO 453: Quantitative Methods I Sec 10: Hypothesis Testing, III Shane Xinyang Xuan 1

Introduction Usually, the OS does everything to hide Inter Process Communication

Tuned Pipes: End-to-end Throughput and Delay Guarantees for USB Devices Ahmad Golchin, Zhuoqun

Piping Systems and Flow Analysis ( Chapter 3) 2 Learning Outcomes (Chapter 3) Losses in

Informational Meeting Phase 2 Infrastructure Rehabilitation Sanitary Sewer and Watermain

GNU/Linux 101 Welcome! Please login with your FSUID GNU/Linux 101 Alex Townsend Research

Delving more deeply into UNIX Bualo Chapter 3 1 / 21 Overview 1) A Little Review 2) Unix

Pipelines and Workflows Adapting to HPC Funding Partners - PowerPoint PPT Presentation

Pipelines and Workflows Adapting to HPC Funding Partners bioexcel.eu Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Pegasus : Introducing Integrity to Scientific Workflows Karan Vahi vahi@isi.edu

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Simplifying ML Workflows with Apache Beam &amp; TensorFlow Extended Tyler Akidau @takidau

Dryad Shell Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 Cluster

GPCO 453: Quantitative Methods I Sec 10: Hypothesis Testing, III Shane Xinyang Xuan 1

Introduction Usually, the OS does everything to hide Inter Process Communication

Tuned Pipes: End-to-end Throughput and Delay Guarantees for USB Devices Ahmad Golchin, Zhuoqun

Piping Systems and Flow Analysis ( Chapter 3) 2 Learning Outcomes (Chapter 3) Losses in

Informational Meeting Phase 2 Infrastructure Rehabilitation Sanitary Sewer and Watermain

GNU/Linux 101 Welcome! Please login with your FSUID GNU/Linux 101 Alex Townsend Research

Delving more deeply into UNIX Bualo Chapter 3 1 / 21 Overview 1) A Little Review 2) Unix

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau