CnC ¡as ¡workflow ¡coordina.on ¡language ¡ ¡ for ¡scien.fic ¡compu.ng ¡
Yves ¡Vandriessche
Parallel ¡Recipes
- Sept. ¡08, ¡2015
Parallel Recipes Yves Vandriessche Sept. 08, 2015 scripts deal - - PowerPoint PPT Presentation
CnC as workflow coordina.on language for scien.fic compu.ng Parallel Recipes Yves Vandriessche Sept. 08, 2015 scripts deal with complexity of gluing together applications + = GATK, BWA,
Yves ¡Vandriessche
scripts deal with complexity of gluing together applications
GATK, BWA, Picard, TopHat, samtools, …
Broad Institute best practices seq. pipeline ~ 200 SLoC
2
Distribution and parallelisation explodes accidental complexity of scripts
GATK, BWA, Picard, TopHat, samtools, …
distributed seq. pipeline ~ 2000 SLoC
distributed seq. pipeline ~ 2000 SLoC
eHive exome pipeline
28,066 SLoC1 (Perl) => $898,255 est.
3
1generated using David A. Wheeler's 'SLOCCount'
parallel recipe:
What ¡is ¡the ¡essential ¡𝚬 ¡between ¡sequential ¡and ¡parallel ¡script?
sources ¡of ¡ordering:
more ¡#orderings more ¡parallelism more ¡performance
In ¡a ¡parallel ¡world: In ¡a ¡sequential ¡world:
=> =>
produce/consume, ¡consistency iteration, ¡branching, ¡recursion, ¡… concurrency ¡(shared ¡resources)
complex ¡glue ¡x ¡complex ¡coordination reuse ¡scripting Intel ¡Concurrent ¡Collections ¡inside ¡
as ¡
CnC ¡offers:
5
6
B
$ echo ‘B’ Bbis finish
parallel hello world recipe:
A A_done finish B B_finished Bbis B_or_C_done C
command:
$ echo ‘another thing for B’
command: in:
B_done
B_or_C_done
$ echo ‘finished’
command: in:
{ A_done, B_or_C_done }
command:
what needs to happen when I start?
in:
what dependencies need to be satisfied before I can start? what dependencies are satisfied after I finished successfully?
7
parallel hello world recipe bis:
practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡
fetch dosier dosier extract gross income income report income
$ wget ftp://citizenfiles.gov/dosiers/yves.txt .
command:
$ grep 'gross' yves.txt > yves_gross.txt
command:
$ echo -n citizen yves is making; cat yves_gross.txt ; echo a year.
command:
8
parallel hello world recipe bis:
fetch dosier dosier extract gross income income report income
$ wget ftp://citizenfiles.gov/dosiers/{}.txt .
command:
$ grep 'gross' {}.txt > {}_gross.txt
command:
$ echo -n citizen {} is making ; cat {}_gross.txt ; echo a year.
command:
yves tom roel
practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡ parallel ¡scripts ¡typically ¡run ¡data-‑parallel ¡
9
parallel hello world recipe bis:
fetch dosier dosier extract gross income income report income fetch dosier dosier extract gross income income report income fetch dosier dosier extract gross income income report income
. . .
yves tom roel
{ "stages" : { "A" : { "command" : "echo A for {}.", "out" : "A_done" }, "B" : { "command" : "echo B for {}.", "out" : "B_finished" }, "Bbis" : { "command" : "echo One more thing for B and {}.", "in" : "B_finished", "out" : "B_or_C_done" }, "C" : { "command" : "echo C for {}.", "out" : "B_or_C_done" }, "finish" : { "command" : "echo Done with A and B for {}.", "in" : ["A_done", "B_or_C_done"] } } }
10
A A_done finish B B_finished Bbis B_or_C_done C
JSON parallel recipe:
11
$ ./precipes -p bpp.dot exome_best_practices_pipeline.json
{ "stages" : { "check_paired" : { "command" : "$CHECK_EXISTS $READS/{}_1.filt.fastq.gz", "out" : "has_paired_end_reads" }, "fetch_unpaired" : { "command" : "$FETCH $READS/{}.filt.fastq.gz $LOCAL_DIR/{}.unpaired.fastq.gz", "out" : "unpaired.fastq.gz" }, "fetch_paired_1" : { "command" : "$FETCH $READS/{}_1.filt.fastq.gz $LOCAL_DIR/{}.paired_1.fastq.gz", "in" : "has_paired_end_reads", "out" : "paired_1.fastq.gz" }, "fetch_paired_2" : { "command" : "$FETCH $READS/{}_2.filt.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz", "in" : "has_paired_end_reads", "out" : "paired_2.fastq.gz" }, "alignment_paired" : { "command" : “\ $BWA mem -R '@RG\\tID:Group1\\tLB:lib1\\tPL:illumina\\tSM:sample1' \
$LOCAL_DIR/{}.paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz \ > $LOCAL_DIR/{}.paired.sam && rm $LOCAL_DIR/{}.paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz", "in" : ["paired_1.fastq.gz", "paired_2.fastq.gz"], "out" : "paired.sam" }, …
check_paired has_paired_end_reads fetch_paired_1 fetch_paired_2 check_no_paired no_paired_end_reads merge_bams_unpaired check_no_unpaired no_unpaired_end_reads merge_bams_paired fetch_unpaired unpaired.fastq.gz alignment_unpaired paired_1.fastq.gz alignment_paired paired_2.fastq.gz unpaired.sam sort_for_coordinate_order_unpaired paired.sam sort_for_coordinate_order_paired sorted_paired.bam merge_bams_paired_unpaired sorted_unpaired.bam sorted.bam remove_duplicates dedup.bam build_bam_index_1 realign_around_indels_1 realign_around_indels_2 dedup.bai intervals 7.bam build_bam_index_2 base_recalibrate_1 base_recalibrate_2 7.bai recal 8.bam 8.bai call_variants vcf vcfinocx[1]
practices pipeline,” Curr. Protoc. Bioinform.11.10.1-11.10.33, October 2013.
Execution
bash$ ¡ ¡./precipes ¡exome_best_practices_pipeline.json ¡sample_{00..07}
12
./precipes core
.json
Execution
bash$ ¡ ¡./precipes ¡exome_best_practices_pipeline.json ¡sample_{00..07}
13
./precipes core
add_stage( “fetch_paired_1”, “$FETCH $READS/…”, { “has_paired_end_reads” }, { “paired_1.fastq.gz” } ); add_stage( “check_paired”, “test -f …”, { }, { “has_paired_end_reads” } ); add_stage( … );
.json
14
sai sam 1.bam
// start running samples in parallel > for( int i = 2; i < argc; ++i ) pipeline.run( argv[i], i-2 );
sample_00 sample_07
…
> pipeline.tags.put( “sample_00” ) > pipeline.tags.put( “sample_01” ) …
Execution
bash$ ¡ ¡./precipes ¡exome_best_practices_pipeline.json ¡sample_{00..07}
pipeline.wait()
15 Exome Best Practices Scaling Experiment Runtime
1d 2d 3d 4d 5d 6d 7d
# compute nodes
1 2 4 8
12h 20m 21h 38m 41h 21m 80h 31m 21h 7m 40h 21m 79h 21m 158h 7m
1 worker thread 2 worker threads
parallel scaling experiment: 32 samples from g1k NA12878
16
Scaling Efficiency : single fat node
Efficiency
17% 33% 50% 67% 83% 100%
Runtime
0d 3,5d 7d 10,5d 14d
# workers
1 2 4 8 16 24 32 64
time(s) efficiency
46,928% 69,132% 72,38% 83,361% 95,285% 96,369% 98,224% 100%
14h 55m 15h 12m 19h 22m 25h 13m 44h 7m 87h 15m 171h 12m 336h 19m
100% 98,224% 96,369% 95,285% 83,361% 72,38% 69,132% 46,928%
(exome best practices, 32 samples)
17
Scaling Efficiency : 1 worker
Efficiency
20% 40% 60% 80% 100%
Runtime
0d 1d 2d 3d 4d 5d 6d 7d
# compute nodes
1 2 4 8 1 worker runtime efficiency
93,60% 97,97% 99,63% 100,00%
21h 7m 40h 21m 79h 21m 158h 7m
100,00% 99,63% 97,97% 93,60% Scaling Efficiency : 2 workers
Efficiency
20% 40% 60% 80% 100%
Runtime
0d 1d 2d 3d 4d 5d 6d 7d
# compute nodes
1 2 4 8 2 workers runtime efficiency
81,60% 93,05% 97,36% 100,00%
12h 20m 21h 38m 41h 21m 80h 31m
100,00% 97,36% 93,05% 81,60%
Scaling Efficiency : cluster
execution trace: 32 samples, 4 nodes, 2 workers
18
19
Common ¡Workflow ¡Language1 ¡(CWL) ¡integration
{ … "run": { "inputs": [ { "inputBinding": { "position": 1, "prefix": "--reverse" }, "type": "boolean", "id": "#reverse" }, { "inputBinding": { "position": 2 }, "type": "File", "id": "#input" } ], … "class": "Workflow" }
1 https://github.com/common-workflow-language/common-workflow-language
core
Shoutout ¡to ¡BOSC ¡CodeFest2015!
20
check_paired has_paired_end_reads fetch_paired_1 fetch_paired_2 split paired_1.fastq.gz paired_2.fastq.gz chunked.paired check_no_paired no_paired_end_reads split chunked.unpaired fetch_unpaired unpaired.fastq.gz alignment_unpaired chunked.unpaired.sai unpaired_sai_to_bam chunked.unpaired.bam join unpaired.bam remove_unaligned_unpaired_reads aligned.unpaired.bam alignment_paired_1 alignment_paired_2 chunked.paired_1.sai chunked.paired_2.sai paired_1_sai_to_bam chunked.paired_1.bam paired_2_sai_to_bam chunked.paired_2.bam sort_paired_1 chunked.sorted.paired_1.bam sort_paired_2 chunked.sorted.paired_2.bam join paired.bam remove_unaligned_paired_reads aligned.paired.bam merge_bams_paired_unpaired 2.bam count count
easy ¡thanks ¡to ¡CnC ¡coordination
advanced workflow coordination
front-‑end ¡language ¡bottleneck
21
easy ¡thanks ¡to ¡CnC ¡coordination
Execution — SplitJoin support
front-‑end ¡language ¡bottleneck
check_paired has_paired_end_reads fetch_paired_1 fetch_paired_2 split paired_1.fastq.gz paired_2.fastq.gz chunked.paired check_no_paired no_paired_end_reads split chunked.unpaired fetch_unpaired unpaired.fastq.gz alignment_unpaired chunked.unpaired.sai unpaired_sai_to_bam chunked.unpaired.bam join unpaired.bam remove_unaligned_unpaired_reads aligned.unpaired.bam alignment_paired_1 alignment_paired_2 chunked.paired_1.sai chunked.paired_2.sai paired_1_sai_to_bam chunked.paired_1.bam paired_2_sai_to_bam chunked.paired_2.bam sort_paired_1 chunked.sorted.paired_1.bam sort_paired_2 chunked.sorted.paired_2.bam join paired.bam remove_unaligned_paired_reads aligned.paired.bam merge_bams_paired_unpaired 2.bam count count
… “splitjoin" : { "split" : { "command" : "zcat -v $LOCAL_DIR/{}.unpaired.fastq.gz \ | split -d -l 40000000 \
gzip --best -c - > $FILE.unpaired.fastq.gz' \
"in" : ["first", "second"], “fanout" : "chunks_in" }, "count" : "ls $LOCAL_DIR/{}.*.unpaired.fastq.gz | wc -l", "stages" : { "process_chunk" : { "command" : "echo processing chunk {}_##", "in" : "chunks_in", "out" : "chunks_out" }, "join" : { "command" : "$SAMTOOLS merge -f $LOCAL_DIR/{}.unpaired.bam \ @($LOCAL_DIR/{}.chunk_##.sorted.unpaired.bam)", "fanin" : "chunks_out", "out" : "splitjoin_finished" } }, …
Execution — advanced coordination expression problem
get( in_deps, N/A )
for dep in out_deps put( out_deps, N/A )
for i in count put( dep, i )
count
Execution — advanced coordination expression problem
get( in_deps, N/A )
for dep in out_deps put( out_deps, N/A )
for i in count put( dep, i )
count
Execution — advanced coordination expression problem
get( in_deps, N/A )
for dep in out_deps put( out_deps, N/A )
for i in count put( dep, i )
count
Coordination languages and their Significance
e can build a complete programming modelExecution — advanced coordination expression problem
Execution — advanced coordination expression problem
bash$ put(chunk_count, `ls *.foo | wc -l`)
27
easy ¡thanks ¡to ¡CnC ¡coordination
Execution — advanced workflow coordination
front-‑end ¡language ¡bottleneck ideal:
bash$ put(chunk_count, `ls *.foo | wc -l`)
not ¡just ¡split/join: ¡recursion, ¡groupby, ¡reduce, ¡streaming, ¡… Use ¡CnC ¡to ¡coordinate ¡with ¡client/peer ¡applications
client/peer ¡coordination ¡bottleneck common: Special ¡construct ¡for ¡each
28
1 http://mesos.apache.org/
core
resource ¡management ¡ deployment towards ¡resource-‑aware ¡scheduling
Mesos ¡integration
29
execution engine!
acceptance
https://github.com/yvdriess/precipes
30