Building and Documenting Bioinformatics Workflows with Python-based - - PowerPoint PPT Presentation

building and documenting bioinformatics workflows with
SMART_READER_LITE
LIVE PREVIEW

Building and Documenting Bioinformatics Workflows with Python-based - - PowerPoint PPT Presentation

Genome Informatics Building and Documenting Bioinformatics Workflows with Python-based Snakemake Johannes K oster, Sven Rahmann German Conference on Bioinformatics September 2012 1 / 13 Genome Structure Informatics 1 Motivation 2


slide-1
SLIDE 1

1 / 13 Genome Informatics

Building and Documenting Bioinformatics Workflows with Python-based Snakemake

Johannes K¨

  • ster, Sven Rahmann

German Conference on Bioinformatics September 2012

slide-2
SLIDE 2

2 / 13 Genome Informatics

Structure

1 Motivation 2 Snakemake Language 3 Snakemake Engine 4 Conclusion

slide-3
SLIDE 3

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-4
SLIDE 4

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-5
SLIDE 5

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-6
SLIDE 6

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-7
SLIDE 7

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-8
SLIDE 8

3 / 13 Genome Informatics

Motivation

data

new samples proteomics protein networks sequence reads ... tools / scripts

samtools bwa gatk ... adjust parameters document results ... plots tables

slide-9
SLIDE 9

4 / 13 Genome Informatics

Workflow Descriptions

IDIR=../include ODIR=obj LDIR=../lib LIBS=-lm CC=gcc CFLAGS=-I$(IDIR) _HEADERS = hello.h HEADERS = $(patsubst %,$(IDIR)/%,$(_HEADERS)) _OBJS = hello.o hellofunc.o OBJS = $(patsubst %,$(ODIR)/%,$(_OBJS)) # build the executable from the object files hello: $(OBJS) $(CC) -o $@ $^ $(CFLAGS) # compile a single .c file to an .o file $(ODIR)/%.o: %.c $(HEADERS) $(CC) -c -o $@ $< $(CFLAGS) # clean up temporary files .PHONY: clean clean: rm -f $(ODIR)/*.o *~ core $(IDIR)/*~

http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor http://www.taverna.org.uk

slide-10
SLIDE 10

5 / 13 Genome Informatics

Why Snakemake?

GNU Make provided us with... a language to write rules to create each output file from input files wildcards for generalization implicit dependency resolution implicit parallelization fast and collaborative development on text files

slide-11
SLIDE 11

5 / 13 Genome Informatics

Why Snakemake?

GNU Make provided us with... a language to write rules to create each output file from input files wildcards for generalization implicit dependency resolution implicit parallelization fast and collaborative development on text files but we missed... easy to read syntax simple scripting inside the workflow creating more than one output file with a rule multiple wildcards in filenames

slide-12
SLIDE 12

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter

slide-13
SLIDE 13

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-14
SLIDE 14

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter @rule("map_reads") input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-15
SLIDE 15

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter @rule("map_reads") @input("hg19.fasta", "{sample}.fastq")

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-16
SLIDE 16

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter @rule("map_reads") @input("hg19.fasta", "{sample}.fastq") @output("{sample}.sai") shell: "bwa aln {input} > {output}"

slide-17
SLIDE 17

6 / 13 Genome Informatics

Snakemake Language

Idea: extend the Python syntax but avoid to write a full parser Snakefile Python tokenizer Token Automaton input: Snakefile tokens emission: Python tokens transition: prefix-free grammar Python Interpreter @rule("map_reads") @input("hg19.fasta", "{sample}.fastq") @output("{sample}.sai") def __map_reads(input, output, wildcards): shell("bwa aln {input} > {output}")

slide-18
SLIDE 18

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19.

slide-19
SLIDE 19

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-20
SLIDE 20

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq"

  • utput: "{sample}.bam"

shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-21
SLIDE 21

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq"

  • utput: "{sample}.bam"

shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-22
SLIDE 22

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq"

  • utput: protected("{sample}.bam")

shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: "{sample}.sai"

shell: "bwa aln {input} > {output}"

slide-23
SLIDE 23

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. SAMPLES = "500 501 502 503".split() rule all: input: expand("{sample}.bam", sample=SAMPLES) rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq"

  • utput: protected("{sample}.bam")

shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: temp("{sample}.sai")

shell: "bwa aln {input} > {output}"

slide-24
SLIDE 24

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule sai_to_bam: input: "hg19.fasta", "{sample}.sai", "{sample}.fastq"

  • utput: protected("{sample}.bam")

shell: "bwa samse {input} | samtools view -Sbh - > {output}" rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: temp("{sample}.sai")

shell: "bwa aln {input} > {output}" rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq

slide-25
SLIDE 25

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule map_reads: input: "hg19.fasta", "{sample}.fastq"

  • utput: temp("{sample}.sai")

shell: "bwa aln {input} > {output}" rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq

slide-26
SLIDE 26

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq

slide-27
SLIDE 27

7 / 13 Genome Informatics

Example Workflow

For samples {500, . . . , 503} map reads to hg19. rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq

slide-28
SLIDE 28

8 / 13 Genome Informatics

Python Rules

import matplotlib.pyplot as plt rule plot_coverage_histogram: input: "{sample}.bam"

  • utput: hist = "{sample}.coverage.pdf"

run: plt.hist(list(map(int, shell("samtools mpileup {input} | cut -f4", iterable = True)))) plt.savefig(output.hist)

slide-29
SLIDE 29

9 / 13 Genome Informatics

R Rules

import rpy2.robjects as robjects rule plot_coverage_histogram: input: "{sample}.fastq"

  • utput: "{sample}.stats.csv"

run: robjects.r(format(""" # some R code """))

slide-30
SLIDE 30

10 / 13 Genome Informatics

Snakemake Engine

rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq

slide-31
SLIDE 31

10 / 13 Genome Informatics

Snakemake Engine

rule all 500.bam, 501.bam, 502.bam, 503.bam rule sai to bam 500.sai rule map reads 500.fastq rule sai to bam 501.sai rule map reads 501.fastq rule sai to bam 502.sai rule map reads 502.fastq rule sai to bam 503.sai rule map reads 503.fastq DAG of jobs each path needs to be executed serially two disjoint paths can be executed in parallel

slide-32
SLIDE 32

11 / 13 Genome Informatics

Building the DAG

File matching

"500.bam" matches "{sample}.bam" ⇔ "500.bam" ∈ L(".+\.bam") In case of ambiguity: Constrain wildcards: "{sample,[0-9]+}.bam" Order rules: ruleorder: sai_to_bam > sort_bam

slide-33
SLIDE 33

12 / 13 Genome Informatics

Job Scheduling

Goals: restrict the number of parallel jobs take threads of individual jobs into account

slide-34
SLIDE 34

12 / 13 Genome Informatics

Job Scheduling

Goals: restrict the number of parallel jobs take threads of individual jobs into account

Job Scheduling Problem

let J be the set of jobs ready to execute let tj be the number of threads a job j uses (1 by default) let T be a given threshold of available cores (I of them being idle) then execute the set of jobs E ∗ among all E ⊆ J that maximizes

  • j∈E

min(tj, T) such that the sum remains bounded by I

slide-35
SLIDE 35

13 / 13 Genome Informatics

Conclusion

Snakemake is a new workflow system that provides: an easy pythonic textual representation multiple wildcards in filenames implicit parallelization and dependency resolution job scheduling that takes threads into account cluster support http://bitbucket.org/johanneskoester/snakemake depends on Python ≥ 3.2