parallel recipes
play

Parallel Recipes Yves Vandriessche Sept. 08, 2015 scripts deal - PowerPoint PPT Presentation

CnC as workflow coordina.on language for scien.fic compu.ng Parallel Recipes Yves Vandriessche Sept. 08, 2015 scripts deal with complexity of gluing together applications + = GATK, BWA,


  1. CnC ¡as ¡workflow ¡coordina.on ¡language ¡ ¡ for ¡scien.fic ¡compu.ng ¡ Parallel ¡Recipes Yves ¡Vandriessche Sept. ¡08, ¡2015

  2. scripts deal with complexity of gluing together applications + = GATK, BWA, Picard, TopHat, samtools, … Broad Institute best practices seq. pipeline ~ 200 SLoC 2

  3. Distribution and parallelisation explodes accidental complexity of scripts ( ) + GATK, BWA, Picard, TopHat, samtools, … = x distributed seq. pipeline ~ 2000 SLoC eHive exome pipeline distributed seq. pipeline 28,066 SLoC 1 (Perl) ~ 2000 SLoC => $898,255 est. 1 generated using David A. Wheeler's 'SLOCCount' 3

  4. parallel recipe: What ¡is ¡the ¡essential ¡ 𝚬 ¡between ¡sequential ¡and ¡parallel ¡script? ordering ¡dependencies! In ¡a ¡sequential ¡world: one ¡single ¡ordering ¡of ¡operations In ¡a ¡parallel ¡world: more ¡#orderings => more ¡parallelism => more ¡performance sources ¡of ¡ordering: •data ¡dependencies produce/consume, ¡consistency •control ¡dependencies iteration, ¡branching, ¡recursion, ¡… concurrency ¡(shared ¡resources)

  5. Parallel ¡Recipes: precipes ¡ complex ¡glue ¡x ¡ complex ¡coordination reuse ¡scripting ordering ¡dependencies Intel ¡Concurrent ¡Collections ¡inside ¡ as ¡ Coordination ¡Language • ¡ ¡cluster-­‑level ¡and ¡node-­‑level ¡parallelism ¡ ¡ CnC ¡offers: • ¡ ¡determinate ¡execution ¡ • ¡ ¡flexible ¡parallel ¡execution ¡model ¡ • ¡ ¡stable ¡& ¡practical ¡implementation ¡(CnC++) 5

  6. parallel hello world recipe: $ echo ‘B’ B command: B out: B_done B_finished $ echo ‘another thing for B’ command: Bbis B_done in: A Bbis C out: B_or_C_done A_done B_or_C_done command: $ echo ‘finished’ finish { A_done, B_or_C_done } finish in: what needs to happen when I start? command: what dependencies need to be satisfied before I can start? in: what dependencies are satisfied after I finished successfully? out: 6

  7. parallel hello world recipe bis: practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡ $ wget ftp://citizenfiles.gov/dosiers/yves.txt . command: fetch dosier dosier command: $ grep 'gross' yves.txt > yves_gross.txt extract gross income income command: $ echo -n citizen yves is making; report income cat yves_gross.txt ; echo a year. 7

  8. parallel hello world recipe bis: practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡ parallel ¡scripts ¡typically ¡run ¡data-­‑parallel ¡ tom roel yves $ wget ftp://citizenfiles.gov/dosiers/{}.txt . command: fetch dosier dosier command: $ grep 'gross' {}.txt > {}_gross.txt extract gross income income command: $ echo -n citizen {} is making ; report income cat {}_gross.txt ; echo a year. 8

  9. parallel hello world recipe bis: out ¡of ¡the ¡box: ¡ ¡ ¡data-­‑parallel ¡runs yves tom roel fetch dosier fetch dosier fetch dosier dosier dosier dosier . . . extract gross income extract gross income extract gross income income income income report income report income report income 9

  10. { "stages" : { "A" : { " command " : "echo A for {}.", " out " : " A_done " }, "B" : { " command " : "echo B for {}.", " out " : " B_finished " }, B "Bbis" : { " command " : "echo One more thing for B and {}.", " in " : " B_finished ", B_finished " out " : " B_or_C_done " }, "C" : { A Bbis C " command " : "echo C for {}.", " out " : " B_or_C_done " }, A_done B_or_C_done "finish" : { " command " : "echo Done with A and B for {}.", " in " : [" A_done ", " B_or_C_done "] finish } } } 10

  11. check_paired has_paired_end_reads JSON parallel recipe: fetch_paired_1 fetch_paired_2 fetch_unpaired paired_1.fastq.gz paired_2.fastq.gz unpaired.fastq.gz $ ./precipes -p bpp.dot exome_best_practices_pipeline.json alignment_paired alignment_unpaired paired.sam unpaired.sam { "stages" : { check_no_unpaired sort_for_coordinate_order_paired sort_for_coordinate_order_unpaired check_no_paired "check_paired" : { "command" : "$CHECK_EXISTS $READS/ {} _1.filt.fastq.gz", no_unpaired_end_reads sorted_paired.bam sorted_unpaired.bam no_paired_end_reads "out" : " has_paired_end_reads " }, merge_bams_paired merge_bams_paired_unpaired merge_bams_unpaired "fetch_unpaired" : { " command " : "$FETCH $READS/ {} .filt.fastq.gz $LOCAL_DIR/ {} .unpaired.fastq.gz", sorted.bam " out " : " unpaired.fastq.gz " }, remove_duplicates "fetch_paired_1" : { " command " : "$FETCH $READS/ {} _1.filt.fastq.gz $LOCAL_DIR/ {} .paired_1.fastq.gz", dedup.bam " in " : " has_paired_end_reads ", " out " : " paired_1.fastq.gz " build_bam_index_1 }, "fetch_paired_2" : { dedup.bai " command " : "$FETCH $READS/ {} _2.filt.fastq.gz $LOCAL_DIR/ {} .paired_2.fastq.gz", " in " : " has_paired_end_reads ", realign_around_indels_1 " out " : " paired_2 . fastq.gz " }, intervals "alignment_paired" : { "command" : “\ realign_around_indels_2 $BWA mem -R '@RG\\tID:Group1\\tLB:lib1\\tPL:illumina\\tSM:sample1' \ -t $NUM_THREADS $REF/ucsc.hg19.fasta \ 7.bam $LOCAL_DIR/ {} .paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz \ > $LOCAL_DIR/ {} .paired.sam && build_bam_index_2 rm $LOCAL_DIR/ {} .paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz", "in" : [" paired_1.fastq.gz ", " paired_2.fastq.gz "], 7.bai "out" : " paired.sam " }, 
 base_recalibrate_1 … recal base_recalibrate_2 11 8.bam 8.bai call_variants [1] G. A. Auwera, M. O. Carneiro, C. Hartlm, et al, “From FastQ data to high ‐ confidence variant calls: the genome analysis toolkit best practices pipeline,” Curr. Protoc. Bioinform.11.10.1-11.10.33, October 2013. vcf vcfinocx

  12. Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} ./precipes • ¡workstation ¡ core .json • ¡cluster ¡ • ¡Amazon ¡EC2 12

  13. Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} ./precipes core .json add_stage( “fetch_paired_1”, “$FETCH $READS/…”, { “ has_paired_end_reads ” }, { “ paired_1.fastq.gz ” } ); add_stage( “check_paired”, “test -f …”, { }, { “ has_paired_end_reads ” } ); add_stage( … ); 13

  14. Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} // start running samples in parallel > for( int i = 2; i < argc; ++i ) pipeline.run( argv[i], i-2 ); sai sample_07 > pipeline.tags.put( “sample_00” ) … > pipeline.tags.put( “sample_01” ) sample_00 sam … pipeline.wait() 1.bam 14

  15. parallel scaling experiment: 32 samples from g1k NA12878 Exome Best Practices Scaling Experiment 7d 1 worker thread 158h 7m 2 worker threads 6d 5d 4d Runtime 80h 31m 79h 21m 3d 2d 41h 21m 40h 21m 1d 21h 38m 21h 7m 12h 20m 1 2 4 8 # compute nodes 15

  16. Scaling Efficiency : single fat node (exome best practices, 32 samples) 100% 100% 98,224% 98,224% 96,369% 96,369% 14d 100% 95,285% 95,285% 336h 19m time(s) efficiency 83,361% 83,361% 83% 72,38% 72,38% 10,5d 69,132% 69,132% 67% Efficiency Runtime 46,928% 46,928% 7d 50% 171h 12m 33% 3,5d 87h 15m 17% 44h 7m 25h 13m 19h 22m 15h 12m 14h 55m 0d 1 2 4 8 16 24 32 64 # workers 16

  17. Scaling Efficiency : cluster Scaling Efficiency : 2 workers Scaling Efficiency : 1 worker 7d 100% 7d 100% 100,00% 100,00% 100,00% 100,00% 99,63% 99,63% 97,36% 97,36% 97,97% 97,97% 158h 7m 93,05% 93,05% 93,60% 93,60% 6d 6d 1 worker runtime 2 workers runtime 80% efficiency efficiency 80% 81,60% 81,60% 5d 5d 60% 60% 4d 4d Efficiency Runtime Efficiency Runtime 80h 31m 79h 21m 3d 3d 40% 40% 2d 2d 41h 21m 40h 21m 20% 20% 1d 1d 21h 38m 21h 7m 12h 20m 0d 0d 1 2 4 8 1 2 4 8 # compute nodes # compute nodes 17

  18. execution trace: 32 samples, 4 nodes, 2 workers 0 1 2 3 18

  19. Next! Common ¡Workflow ¡Language 1 ¡(CWL) ¡integration • ¡workstation ¡ { core … • ¡cluster ¡ "run": { "inputs": [ • ¡amazon ¡ec2 { "inputBinding": { "position": 1, "prefix": "--reverse" }, "type": "boolean", "id": "#reverse" }, { "inputBinding": { "position": 2 }, "type": "File", "id": "#input" } ], … "class": "Workflow" } Shoutout ¡to ¡BOSC ¡CodeFest2015! 1 https://github.com/common-workflow-language/common-workflow-language 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend