TRAPPIST:
A toolkit for comparative analysis and visualization of genomic regions
Geraldine A. Van der Auwera, PhD
https://github.com/gglobster/trappist
https://github.com/gglobster/trappist TRAPPIST: d e r u t a - - PowerPoint PPT Presentation
TRAPPIST: A toolkit for comparative analysis and visualization of genomic regions Geraldine A. Van der Auwera, PhD https://github.com/gglobster/trappist TRAPPIST: d e r u t a e f - l l u f n o A toolkit for comparative
https://github.com/gglobster/trappist
https://github.com/gglobster/trappist
f u l l
e a t u r e d a p p l i c a t i
Evolution = 4 bn years
… and you thought legacy Fortran code was a pain
Getting the code from the
repository (living beings)
Reverse-engineering the code
(zero documentation!) Extraction, sequencing, assembly Experimentation, mutagenesis + (comparative) sequence analysis
Lincoln Stein via C. Titus Brown @PyCon 2011
Lincoln Stein via C. Titus Brown @PyCon 2011
No genomes entirely experimental
Make random mutants, trace back effect to gene of interest
One genome some predictive filtering
Design mutants, long iterative process
Many related genomes much better predictive filtering
Nature’s mutants, drastically reduced iterative process
pXO1
p03BB102_179 pAH820_272 pAH187_270 NZ_ACMR0 NZ_ACMH0 IS075 pBc10987 NZ_ACMC0 NZ_ACMS0 NZ_ACMT0 NZ_ACNI0 VD022 Schrouff NZ_ACMO0 TIAC129 NZ_ACNJ0 NZ_ABDM0 NZ_ACNB0 NZ_ACLY0 NZ_ACMP0 NZ_ACNK0 pBc239 NZ_ACNE0 NZ_ABDA0 NZ_ACLV0 NZ_ACLT0 NZ_ACNF0 NZ_ACNA0 repX rep2 tra1 tra2
Anthrax PAI
pBCXO1 tra3
All done through separate GUIs poor batching, no automation, no chaining BLAST
The servers can be accessed with scripts*, and there are
awesome libraries that provide wrappers, data structures etc. But here’s the rub…
* (there are a few GUI pipeline apps but usability is an issue </diplomatic>)
Exhibit A: Experimental Biologist
Totally Rad Analysis Pipelines Python Super Tool
Totally Rad Analysis Pipelines Python Super Tool
Totally Rad Analysis Pipelines Python Super Tool
CONTIG FISHER CONSERVATION SORTER PHYLOGENY CONSTRUCT SEQUENCE VARIATION list of genomes genome DBs reference list of targets plasmids
OR
PHYLOGENY CONGRUENCE STRUCTURE COMPARATOR HOST HK_SET EXTRACTOR list of genes segment sets CONTENT FUNCTIONS plasmid trees
] [
H_K sets
] [
host trees reference + constructs
Data Input Core Analysis Optional Analyses
Exhibit B: Lazy Postdoc (me)
Collection of one-time scripts Toolkit Full-featured application DNA-based OS?
Design, assembly and modification of pipelines /
Automated execution, parameter / output versioning,
Basic requirements
intuitive flexible extensible
+ Pre-assembled workflows / pipelines + Hooks for external / roll-your-own functions
TRAPPIST provides discrete task components
for every step of analysis:
Initial inputs selection Data processing steps
(existing algorithms)
Graphical output Component I/O relies on matching ports with
data object classes
Forces validation of data type/format (not up to user)
Progressive / modular / dependency-aware Parameter set versioning linked to output versioning
Users more likely to try various parameters to test assumptions
Component ports have “fill” status
If all its inputs are filled, component is OK to execute
add component to execution queue
I’m OK to go!
Central DB Staging DB Execution DB Staging DB Execution DB Workflow 1 Workflow 2
DB schema + data dump sufficient to fully describe a workflow
Provenance bundle including:
Workflow schema Parameter sets Code version info