Snakemake Johannes K oster Genome Informatics, Institute of Human - PowerPoint PPT Presentation

Snakemake Johannes K¨ oster Genome Informatics, Institute of Human Genetics, Faculty of Medicine, University Duisburg-Essen April 10, 2014 1 / 16 Genome Informatics

Structure 1 Motivation 2 Basic Idea 3 Advanced Features 2 / 16 Genome Informatics

Outline 1 Motivation 2 Basic Idea 3 Advanced Features 3 / 16 Genome Informatics

Motivation What we liked about GNU Make: • text based • rule paradigm • lightweight And what not: • cryptic syntax • limited scripting • multiple output files • scalability 4 / 16 Genome Informatics

Snakemake • hook into python interpreter • pythonic syntax for rule definition • full python scripting • scalability • workflow specific functionality beyond Make basics • stable community: 5 / 16 Genome Informatics

Syntax SAMPLES = ”500 501 502 503” . s p l i t ( ) # require a bam for each sample r u l e a l l : i n p u t : expand ( ” { sample } .bam” , sample=SAMPLES) # map reads r u l e map : i n p u t : ” r e f e r e n c e . bwt” , ” { sample } . f a s t q ” output : ” { sample } .bam” t h r e a d s : 8 s h e l l : ”bwa mem - t { t h r e a d s } { i n p u t } | ” # refer to threads and input files ” samtools view - Sbh - > { output } ” # refer to output files # create an index r u l e index : i n p u t : ” r e f e r e n c e . f a s t a ” output : ” r e f e r e n c e . bwt” s h e l l : ”bwa index { i n p u t } ” 7 / 16 Genome Informatics

Basic Usage # perform a dry-run $ snakemake - n # execute the workflow using 8 cores $ snakemake - j 8 # execute the workflow on a cluster (with up to 20 jobs) $ snakemake - j 20 - - c l u s t e r ”qsub - pe threaded { threads } ” 8 / 16 Genome Informatics

Visualization # visualize the DAG of jobs $ snakemake - - dag | dot | d i s p l a y index map map map map sample: 503 sample: 500 sample: 502 sample: 501 all 9 / 16 Genome Informatics

Advanced Syntax SAMPLES = ”500 501 502 503” . s p l i t ( ) r u l e a l l : i n p u t : expand ( ” { sample } .bam” , sample=SAMPLES) # map reads with peanut r u l e map : i n p u t : ” r e f e r e n c e . hdf5 ” , ” { sample } . f a s t q ” output : ” { sample } .bam” t h r e a d s : 8 r e s o u r c e s : gpu=1 # define an additional resource v e r s i o n : s h e l l ( ” peanut - - v e r s i o n ” ) s h e l l : ” peanut map - t { t h r e a d s } { i n p u t } | ” ” samtools view - Sbh - > { output } ” # create an index with peanut r u l e index : i n p u t : ” r e f e r e n c e . f a s t a ” output : ” r e f e r e n c e . hdf5 ” s h e l l : ” peanut index { i n p u t } { output } ” 11 / 16 Genome Informatics

Scheduling Maximize the number of running jobs with respect to • priority • number of descendants • input size while not exceeding • provided cores • provided resources A multi-dimensional knapsack problem. 12 / 16 Genome Informatics

Sub-Workflows SAMPLES = ”500 501 502 503” . s p l i t ( ) # define subworkflow subworkflow : workdir : ” . . / mapping” r u l e a l l : i n p u t : expand ( ” { sample } / r e s u l t s . xprs ” , sample=SAMPLES) # estimate transcript expressions r u l e e x p r e s s : i n p u t : REF , mapping ( ” { sample } . bam” ) # refer to output of subworkflow output : ” { sample } / r e s u l t s . xprs ” s h e l l : ” e x p r e s s { i n p u t } - o { w i l d c a r d s . sample } ” 13 / 16 Genome Informatics

HTML5 Reports from snakemake . u t i l s import r e p o r t r u l e r e p o r t : i n p u t : T1=” r e s u l t s . csv ” , F1=” p l o t . pdf ” output : html=” r e p o r t . html ” run : r e p o r t ( ””” = = = = = = = = = = Some T i t l e = = = = = = = = = = See t a b l e T1 , d i s p l a y some math . . math : : | cq 0 - cq 1 | > { MDIFF } ””” , output . html , ∗∗ i n p u t ) 14 / 16 Genome Informatics

Data Provenance Summarize output file status $ snakemake - - summary f i l e date r u l e v e r s i o n s t a t u s plan 500.bam Thu Apr 10 10:55:17 2014 map 1.0 ok no update 501.bam Thu Apr 10 10:55:17 2014 map 1.0 ok no update 502.bam Thu Apr 10 10:55:17 2014 map 1.0 updated i n p u t f i l e s update pending 503.bam Thu Apr 10 10:55:17 2014 map 0.9 v e r s i o n changed to 1.0 no update Trigger updates: # update files with changed versions $ snakemake -R `snakemake - - l i s t - v e r s i o n - changes ` # update files with changed code $ snakemake -R `snakemake - - l i s t - code - changes ` 15 / 16 Genome Informatics

Conclusion Snakemake is a Make-like workflow system providing • a readable syntax • sophisticated scripting with python • scalability from single-core to cluster • support for hybrid computing • data provenance • modularization capabilities Roadmap: • DRMAA support • a workflow or rule library http://bitbucket.org/johanneskoester/snakemake K¨ oster, J., Rahmann, S., Snakemake – a scalable bioinformatics workflow engine. Bioinformatics 2012. 16 / 16 Genome Informatics

Snakemake Johannes K oster Genome Informatics, Institute of Human - PowerPoint PPT Presentation

Snakemake Johannes K oster Genome Informatics, Institute of Human Genetics, Faculty of Medicine, University Duisburg-Essen April 10, 2014 1 / 16 Genome Informatics Structure 1 Motivation 2 Basic Idea 3 Advanced Features 2 / 16 Genome

Building and Documenting Bioinformatics Workflows with Python-based Snakemake Johannes K

Uni.lu HPC School 2019 PS12: Bioinformatics workflows with Snakemake and Conda Uni.lu High

Supply and Demand: Supply and Demand: Price and Quantity Price and Quantity Determination in

Peanut Weed Control Update - 2016 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

Peanut Weed Control Update - 2017 Eric P. Prostko, Ph.D. Professor and Extension Weed Specialist

Figure 2 . Figure 3 . Figure 4 US Nuclear Industry Is Achieving Record Levels of Performance

It Takes Two to Make a Thing Go Right: Support for Junior Developers in the Workplace Amy

CMSC201 Computer Science I for Majors Lecture 07 Strings and Lists Prof. Jeremy Dixon Based

COVID 19 UPDATE 4-16-2020 Todays Agenda Welcome and announcements - Cathy Franklin

CISC 1100: Structures of Computer Science Chapter 3 Logic Arthur G. Werschulz Fordham

Delivering Kubernetes Apps with Helm Michelle Noorali @michellenoorali Adnan Abdulhussein

The Point of Holly Schinsky devgirlfl devgirl.org Vue.js is The Progressive JavaScript

CMSC201 Computer Science I for Majors Lecture 07 Strings and Lists Prof. Katherine Gibson

PEAR Session PHP Quebec Conference 2005 Welcome! Welcome to the PEAR Session! I hope you will

EE 200 Lecture 5: Arrays Steven Bell 18 September 2019 Indexing arrays int grape[10] 0 1 2 3

PEAR: a Tool for Reasoning About Scenarios and Probabilities in Description Logics of Typicality

Database-enabled web technology Summary Instructor: C a gr C oltekin

https://www.polleverywhere.com/free_text_polls/2ayeXXU8HibalP8 Impactful STEM Exhibits

Regional Evaluation Workshop PEARS PSE Reporting for FFY 2017 This material was produced by the

Lecture 5: Strings (Sections 8.1, 8.2, 8.4, 8.5, 1 st paragraph of 8.9) CS 1110 Introduction to

Teaching Unstructured Information Management: Theory and Applications to Computational

Meeting #2 Existing Conditions/Preliminary Screening September 24 th , 2019 Agenda

pbsacct : A Workload Analysis System for PBS-Based HPC Systems Troy Baer Senior HPC System

15-110 Practice Exam 1 Show work when needed, it can be used for partial credit! Also note that