Building and Documenting Bioinformatics Workflows with Python-based - PowerPoint PPT Presentation

Genome Informatics Building and Documenting Bioinformatics Workflows with Python-based Snakemake Johannes K¨ oster, Sven Rahmann German Conference on Bioinformatics September 2012 1 / 13

Genome Structure Informatics 1 Motivation 2 Snakemake Language 3 Snakemake Engine 4 Conclusion 2 / 13

Genome Motivation Informatics new bwa samples gatk samtools ... tables tools / proteomics scripts results data plots adjust protein parameters networks document ... sequence ... reads 3 / 13

Genome Workflow Descriptions Informatics IDIR=../include ODIR=obj LDIR=../lib LIBS=-lm CC=gcc CFLAGS=-I$(IDIR) _HEADERS = hello.h HEADERS = $(patsubst %,$(IDIR)/%,$(_HEADERS)) _OBJS = hello.o hellofunc.o OBJS = $(patsubst %,$(ODIR)/%,$(_OBJS)) # build the executable from the object files hello: $(OBJS) $(CC) -o $@ $^ $(CFLAGS) # compile a single .c file to an .o file $(ODIR)/%.o: %.c $(HEADERS) $(CC) -c -o $@ $< $(CFLAGS) # clean up temporary files .PHONY: clean clean: rm -f $(ODIR)/*.o *~ core $(IDIR)/*~ http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor http://www.taverna.org.uk 4 / 13

Genome Why Snakemake? Informatics GNU Make provided us with... a language to write rules to create each output file from input files wildcards for generalization implicit dependency resolution implicit parallelization fast and collaborative development on text files 5 / 13

Genome Why Snakemake? Informatics GNU Make provided us with... a language to write rules to create each output file from input files wildcards for generalization implicit dependency resolution implicit parallelization fast and collaborative development on text files but we missed... easy to read syntax simple scripting inside the workflow creating more than one output file with a rule multiple wildcards in filenames 5 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter 6 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar rule map_reads: input: "hg19.fasta", "{sample}.fastq" Python Interpreter output: "{sample}.sai" shell: "bwa aln {input} > {output}" 6 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar @rule("map_reads") input: "hg19.fasta", "{sample}.fastq" Python Interpreter output: "{sample}.sai" shell: "bwa aln {input} > {output}" 6 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar @rule("map_reads") @input("hg19.fasta", "{sample}.fastq") Python Interpreter output: "{sample}.sai" shell: "bwa aln {input} > {output}" 6 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar @rule("map_reads") @input("hg19.fasta", "{sample}.fastq") Python Interpreter @output("{sample}.sai") shell: "bwa aln {input} > {output}" 6 / 13

Genome Snakemake Language Informatics Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar @rule("map_reads") @input("hg19.fasta", "{sample}.fastq") Python Interpreter @output("{sample}.sai") def __map_reads(input, output, wildcards): shell("bwa aln {input} > {output}") 6 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. rule map_reads: input: "hg19.fasta", "{sample}.fastq" output: "{sample}.sai" shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq" output: "{sample}.bam" shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq" output: "{sample}.sai" shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq" output: "{sample}.bam" shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq" output: "{sample}.sai" shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq" output: protected("{sample}.bam") shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq" output: "{sample}.sai" shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq" output: protected("{sample}.bam") shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq" output: temp("{sample}.sai") shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq" rule sai to bam rule sai to bam rule sai to bam rule sai to bam output: protected("{sample}.bam") 500.sai 501.sai 502.sai 503.sai shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq" rule map reads rule map reads rule map reads rule map reads output: temp("{sample}.sai") 500.fastq 501.fastq 502.fastq 503.fastq shell: "bwa aln {input} > {output}" 7 / 13

Genome Example Workflow Informatics For samples { 500 , . . . , 503 } map reads to hg19. rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam rule sai to bam rule sai to bam rule sai to bam 500.sai 501.sai 502.sai 503.sai rule map_reads: input: "hg19.fasta", "{sample}.fastq" rule map reads rule map reads rule map reads rule map reads output: temp("{sample}.sai") 500.fastq 501.fastq 502.fastq 503.fastq shell: "bwa aln {input} > {output}" 7 / 13

Building and Documenting Bioinformatics Workflows with Python-based - PowerPoint PPT Presentation

Genome Informatics Building and Documenting Bioinformatics Workflows with Python-based Snakemake Johannes K oster, Sven Rahmann German Conference on Bioinformatics September 2012 1 / 13 Genome Structure Informatics 1 Motivation 2

A UML Activity Diagram Extension and Template for Bioinformatics Workflows: A Design Science Study

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

COMPUTATIONAL PROTEOMICS AND METABOLOMICS Oliver Kohlbacher, Sven

Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

What is proteomics good for? IBIP19: Integrative Biological Interpretation using Proteomics with

Provenance, End-User Trust and Reuse: An Empirical Investigation Devan Ray Donaldson and Kathleen

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Multiple phenotypes

Little bioinformaticians pragmatic guide to internships in Debian By Tatiana Malygina &

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

Building and Documenting Bioinformatics Workflows with Python-based - PowerPoint PPT Presentation

Genome Informatics Building and Documenting Bioinformatics Workflows with Python-based Snakemake Johannes K oster, Sven Rahmann German Conference on Bioinformatics September 2012 1 / 13 Genome Structure Informatics 1 Motivation 2

A UML Activity Diagram Extension and Template for Bioinformatics Workflows: A Design Science Study

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata

Achieving Coordination Through Dynamic Construction of Open Workflows Louis Thomas, Justin

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

COMPUTATIONAL PROTEOMICS AND METABOLOMICS Oliver Kohlbacher, Sven

Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

What is proteomics good for? IBIP19: Integrative Biological Interpretation using Proteomics with

Provenance, End-User Trust and Reuse: An Empirical Investigation Devan Ray Donaldson and Kathleen

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture20: Multiple phenotypes

Little bioinformaticians pragmatic guide to internships in Debian By Tatiana Malygina &amp;

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Little bioinformaticians pragmatic guide to internships in Debian By Tatiana Malygina &