BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila - PowerPoint PPT Presentation

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila Genome Project Berkeley Drosophila Genome Project cjm@ @fruitfly fruitfly.org .org cjm

build networks build networks  Many  Many bioinformaticians bioinformaticians spend large amounts of time spend large amounts of time coding and running build/make networks build/make networks coding and running  A build network is a recipe describing the execution  A build network is a recipe describing the execution of a collection of interdependent heterogeneous tasks interdependent heterogeneous tasks of a collection of  sequence analysis pipelines sequence analysis pipelines   data data ‘ ‘compilation compilation’ ’   importing, transforming and exporting data  importing, transforming and exporting data  LIMS LIMS   Error prone, tedious repetitive code and hard to  Error prone, tedious repetitive code and hard to configure configure

Existing approaches Existing approaches  Run tasks by hand, or with ad-hoc scripts  Run tasks by hand, or with ad-hoc scripts  doesn doesn’ ’t scale, leads to insanity t scale, leads to insanity   Unix/GNU  Unix/GNU makefiles makefiles  ++: concise, generic, high level abstraction ++: concise, generic, high level abstraction   --: limited expressive power, --: limited expressive power, hacky hacky   makefile  makefile replacements (cons, replacements (cons,scons scons,ant,build ,ant,build… …) )  geared towards software development geared towards software development   Bio compute pipeline software (  Bio compute pipeline software (biopipe biopipe, , enspipe enspipe) )  excellent for certain tasks excellent for certain tasks   not completely generic not completely generic 

biomake: executable : executable biomake computational protocols computational protocols  A declarative language for specifying build  A declarative language for specifying build networks networks  concise, Turing-complete, highly configurable concise, Turing-complete, highly configurable   Dependency management  Dependency management  e.g e.g mask genomic sequence prior to mask genomic sequence prior to genefinding genefinding   Local and remote job execution  Local and remote job execution  compute farm job management compute farm job management   Filesystem  Filesystem or database oriented or database oriented

Example Genomic Sequence Example Genomic Sequence Analysis Pipeline Analysis Pipeline  Prepare and cut assembled sequence into slices  Prepare and cut assembled sequence into slices  Download latest NR peptide dataset and index it  Download latest NR peptide dataset and index it  Blast genomic slices against NR and other datasets  Blast genomic slices against NR and other datasets  RepeatMask  RepeatMask genomic slices and run genomic slices and run genefinders genefinders on on masked sequence masked sequence  Synthesise  Synthesise gene models and do peptide analysis on gene models and do peptide analysis on results results  Store everything in a relational db, and prepare files  Store everything in a relational db, and prepare files for export to public for export to public

Target dependencies Target dependencies FastaDB Flatfile Assembled (remote) Export Genomic Sequence XML BlastIndexed FastaDB Export FastaDB (local) Chunked Blast BlastP & HMM Relational Sequence Alignments Alignments Database RepeatMasked Gene Gene XML Sequence Predictions Models Import

Specifying Targets Specifying Targets A generic target formatdb(D) pattern has a name run: formatdb -i D and arguments BlastIndexed FastaDB FastaDB (local) Chunked Blast Sequence Alignments blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A A target pattern has tags for flat: S-blast/ bn(S).D . P .out specifying dependencies, actions and filesystem or database IDs

BioMake Execution Execution BioMake TARGET: blast(P,S,D,A) blast( blast(blastx, blastx, my.seq, my.seq, nr.fly, nr.fly, -) -) req: formatdb(D) SUBTARGET: run: blastall -p P -i S -d D A formatdb( formatdb(nr.fly) nr.fly) flat: S-blast/ bn(S).D . P .out RUN: formatdb -i nr -p T formatdb(D) OUT: nr.fly.{psq,pin,phr} run: formatdb -i D formatdb-nr.fly.OK RUN: blastall -p blastx -i my.seq -d nr.fly OUT: my.seq-blast/my.seq.nr.fly.blastx.out my.seq-blast/my.seq.nr.fly.blastx.out.OK

Targets can be nested Targets can be nested store(bop(my.seq, genscan(repeatmask(my.seq, drosophila)))) target instantiations can be thought of as a skolem IDs repeatmask(S,Org) genscan(S) run: repeatmask S -a Org run: genscan S flat: S .mask flat: S -pred/ bn(S) .genscan.out bop(S,B) store(XML) run: apollo -bop -s S B -o target run: xml2db XML flat: B .game.xml

Iteration Iteration  Pipelines frequently involve iterating over  Pipelines frequently involve iterating over collections of data: collections of data:  Perform a sequence analysis on every entry in a Perform a sequence analysis on every entry in a  multi-fasta fasta format file format file multi-  Perform a peptide analysis on every gene Perform a peptide analysis on every gene  prediction in some genscan genscan output output prediction in some  Query a database for a list of IDs and perform Query a database for a list of IDs and perform  some task on each some task on each  biomake  biomake has language constructs for iteration has language constructs for iteration

Iterating over datasets Iterating over datasets MultiFasta >seq1 TAGGTATTGGTT AGGTGCGTCCTC analyze_multifasta(F) >seq2 GCGGTATAGCTT iterate: analyze_seq(S) TTCCTTCTCTCT >seq3 where S in splitfasta(F) CAAAGCAGAGAT ATATTTATTCGC splitfasta(F) >seq1 >seq2 >seq3 run: splitfasta.pl -d F -split -md5 F TAGGTATTGGTT GCGGTATAGCTT CAAAGCAGAGAT AGGTGCGTCCTC TTCCTTCTCTCT ATATTTATTCGC flat: F -split/ bn(F). pathlist comment: splitfasta.pl is part of the biomake distro seq1.genie.out seq3.genie seq1.nr.blastx.out analyze_seq(S) seq3.nr.blastx.out req: genie(S) seq2.genie blast( blastx ,S, nr ,-) seq2.nr.blastx.out

Controlling the runmode runmode Controlling the  Tasks can be run locally or on a compute farm,  Tasks can be run locally or on a compute farm, synchronously or asynchronously synchronously or asynchronously  wrapper provided for PBS wrapper provided for PBS   runmode  runmode: : tag states the mode and wrapper for tag states the mode and wrapper for a particular target pattern a particular target pattern  can be set globally and per-pattern can be set globally and per-pattern   special status targets provide execution status  special status targets provide execution status

runmode example example runmode The blast job will be blast(P,S,D,A) executed on the compute req: formatdb(D) farm via qsubwrap run: blastall -p P -i S -d D A (comes with biomake flat: S-blast/ bn(S).D . P .out distro) runmode: async(qsubwrap) Upon submission, the status target biomake can automatically status_run(blast(P,S,D,A)) status_run(blast(P,S,D,A)) handle moving data in will be generated; on and out between user’s completion, the target filesystem (or db) and status_ok (blast(P,S,D,A)) status_ok (blast(P,S,D,A)) local cluster nodes will be generated

Datastores Datastores  BioMake  BioMake persists targets in persists targets in Datastores Datastores  The  The flat: flat: tag flattens targets to unique tag flattens targets to unique datastore datastore IDs IDs  Datastore  Datastore can be can be filesystem filesystem or relational database or relational database  default is default is filesystem filesystem   can be set globally or per target can be set globally or per target   e.g. analysis result targets can be stored on  e.g. analysis result targets can be stored on filesystem filesystem, status , status targets stored in DB targets stored in DB  NFS traffic can be avoided on compute farm by  NFS traffic can be avoided on compute farm by storing targets and status targets in a database storing targets and status targets in a database

Asynchronous Execution Asynchronous Execution For each target T to be built: local machine 1) biomake fetches status of T running biomake skips T if status = ok/run 2) biomake stores status_run(T) status_run(T) AGENT 3) biomake creates a runner agent script and submits it to the cluster NFS scheduler 4) continue onto next target AGENT 1) agent fetches fetches any input data 2) agent runs runs command (eg blast) node synchronously 3) agent stores stores result 4) agent stores stores status of T as ‘ok’ or ‘err’

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila - PowerPoint PPT Presentation

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila Genome Project Berkeley Drosophila Genome Project cjm@ @fruitfly fruitfly.org .org cjm build networks build networks Many Many bioinformaticians

Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D.

Davis-Besse Nuclear Power Station August 15, 2002 1 Introduction FENOC Chief Operating Officer

The Role Of Mutation Analysis in Porphyria. Dr SharonWhatley Cardiff SAS Porphyria Service

mtDNAprofiler A web based Program for Nomenclature and Comparison of mtDNA Sequences In Seok

Successful gene expression studies using validated qPCR assays Jan Hellemans, CEO Biogazelle

BOLERO Hair volume/surface measurement fly-away/frizz analysis system Sebastien BREUGNOT &

It All Started in Ulithi, Micronesia Colonized a long time ago Hawaii Japan by polynesian

X-Line 101 June 2019 X-Line 101 X-Line Unit Overview What makes X-Line unique X-Line 101

Instrumentation best practices in Brewing Slide 1 Ola Wesstrom Instrumentation best practices in

Schouw & Co. Capital Markets Day Langelinie Pavillonen, 15 June 2017 Schouw & Co. CMD

marine refrigeration and air conditioning Our History Headquarters Shipbuilding Industry

Residential Sector AIM Training Workshop Tokyo, Japan Oct 22- 26, 2007 Residential Sector

Accelerating Condensate Development in the Heart of the Montney While Retaining Capital

Baselines for Retail Demand Response Programs Bruce Kaneshiro California Public Utilities

Baseline Budget Projections A Joint Seminar by the Congressional Budget Office and the

Goal II: Math 1 Key Performance Indicators Baseline Presentation March 22, 2018 S H A R O N L

EVAAS Presentation March 28, 2019 S H A R O N L . C O N T R E R A S , P H . D . | S U P E

California Public Utilities Commission Residential Rate Reform Through 2019 California Public

SLO Growth Targets How to determine & set growth targets Todays Learning Targets I CAN

Hillside Marine Baseline Overview AUSTRALIAS NEXT GREAT COPPER PROJECT HILLSIDE: SOUTH

SPUR Evening Forum- June 26, 2019 Good Food Purchasing Policy 2 Todays Agenda San

FRAMEW EWORK RK O OF VOLUNTARY A Y AGR GREE EEMEN ENTS TO U UPDATE TE A AND IMPLEMENT TH

Sounder Program Update System Expansion Committee 11/14/19 Why we are here Sounder Program

MONTGOMERY COUNTY , MARYLAND TRANSPORTATION TECHNICAL WORKING GROUP FOR CLIMATE ACTION PLANNING

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila - PowerPoint PPT Presentation

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila Genome Project Berkeley Drosophila Genome Project cjm@ @fruitfly fruitfly.org .org cjm build networks build networks Many Many bioinformaticians

Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D.

Davis-Besse Nuclear Power Station August 15, 2002 1 Introduction FENOC Chief Operating Officer

The Role Of Mutation Analysis in Porphyria. Dr SharonWhatley Cardiff SAS Porphyria Service

mtDNAprofiler A web based Program for Nomenclature and Comparison of mtDNA Sequences In Seok

Successful gene expression studies using validated qPCR assays Jan Hellemans, CEO Biogazelle

BOLERO Hair volume/surface measurement fly-away/frizz analysis system Sebastien BREUGNOT &amp;

It All Started in Ulithi, Micronesia Colonized a long time ago Hawaii Japan by polynesian

X-Line 101 June 2019 X-Line 101 X-Line Unit Overview What makes X-Line unique X-Line 101

Instrumentation best practices in Brewing Slide 1 Ola Wesstrom Instrumentation best practices in

Schouw &amp; Co. Capital Markets Day Langelinie Pavillonen, 15 June 2017 Schouw &amp; Co. CMD

marine refrigeration and air conditioning Our History Headquarters Shipbuilding Industry

Residential Sector AIM Training Workshop Tokyo, Japan Oct 22- 26, 2007 Residential Sector

Accelerating Condensate Development in the Heart of the Montney While Retaining Capital

Baselines for Retail Demand Response Programs Bruce Kaneshiro California Public Utilities

Baseline Budget Projections A Joint Seminar by the Congressional Budget Office and the

Goal II: Math 1 Key Performance Indicators Baseline Presentation March 22, 2018 S H A R O N L

EVAAS Presentation March 28, 2019 S H A R O N L . C O N T R E R A S , P H . D . | S U P E

California Public Utilities Commission Residential Rate Reform Through 2019 California Public

SLO Growth Targets How to determine &amp; set growth targets Todays Learning Targets I CAN

Hillside Marine Baseline Overview AUSTRALIAS NEXT GREAT COPPER PROJECT HILLSIDE: SOUTH

SPUR Evening Forum- June 26, 2019 Good Food Purchasing Policy 2 Todays Agenda San

FRAMEW EWORK RK O OF VOLUNTARY A Y AGR GREE EEMEN ENTS TO U UPDATE TE A AND IMPLEMENT TH

Sounder Program Update System Expansion Committee 11/14/19 Why we are here Sounder Program

MONTGOMERY COUNTY , MARYLAND TRANSPORTATION TECHNICAL WORKING GROUP FOR CLIMATE ACTION PLANNING

BOLERO Hair volume/surface measurement fly-away/frizz analysis system Sebastien BREUGNOT &

Schouw & Co. Capital Markets Day Langelinie Pavillonen, 15 June 2017 Schouw & Co. CMD

SLO Growth Targets How to determine & set growth targets Todays Learning Targets I CAN