https://github.com/gglobster/trappist TRAPPIST: d e r u t a - - PowerPoint PPT Presentation

https github com gglobster trappist trappist d e r u t a
SMART_READER_LITE
LIVE PREVIEW

https://github.com/gglobster/trappist TRAPPIST: d e r u t a - - PowerPoint PPT Presentation

TRAPPIST: A toolkit for comparative analysis and visualization of genomic regions Geraldine A. Van der Auwera, PhD https://github.com/gglobster/trappist TRAPPIST: d e r u t a e f - l l u f n o A toolkit for comparative


slide-1
SLIDE 1

TRAPPIST:

A toolkit for comparative analysis and visualization of genomic regions

Geraldine A. Van der Auwera, PhD

https://github.com/gglobster/trappist

slide-2
SLIDE 2

TRAPPIST:

A toolkit for comparative analysis and visualization of genomic regions

Geraldine A. Van der Auwera, PhD Shankar Ambady

https://github.com/gglobster/trappist

f u l l

  • f

e a t u r e d a p p l i c a t i

  • n
slide-3
SLIDE 3

The source code of Life

 Evolution = 4 bn years

  • f forking without

version tracking

… and you thought legacy Fortran code was a pain

slide-4
SLIDE 4

Two distinct issues

 Getting the code from the

repository (living beings)

 Reverse-engineering the code

(zero documentation!) Extraction, sequencing, assembly Experimentation, mutagenesis + (comparative) sequence analysis

slide-5
SLIDE 5

Top issue for “getting”

Lincoln Stein via C. Titus Brown @PyCon 2011

NGS is outscaling Moore’s Law

slide-6
SLIDE 6

Top issue for “getting”

Lincoln Stein via C. Titus Brown @PyCon 2011

NGS is outscaling Moore’s Law

W H A T E V E R

slide-7
SLIDE 7

Evolving process of rev-eng

 No genomes  entirely experimental

 Make random mutants, trace back effect to gene of interest

 One genome  some predictive filtering

 Design mutants, long iterative process

 Many related genomes  much better predictive filtering

 Nature’s mutants, drastically reduced iterative process

slide-8
SLIDE 8

Nature’s mutants (example)

pXO1

p03BB102_179 pAH820_272 pAH187_270 NZ_ACMR0 NZ_ACMH0 IS075 pBc10987 NZ_ACMC0 NZ_ACMS0 NZ_ACMT0 NZ_ACNI0 VD022 Schrouff NZ_ACMO0 TIAC129 NZ_ACNJ0 NZ_ABDM0 NZ_ACNB0 NZ_ACLY0 NZ_ACMP0 NZ_ACNK0 pBc239 NZ_ACNE0 NZ_ABDA0 NZ_ACLV0 NZ_ACLT0 NZ_ACNF0 NZ_ACNA0 repX rep2 tra1 tra2

Anthrax PAI

pBCXO1 tra3

slide-9
SLIDE 9

Typical analysis process

All done through separate GUIs  poor batching, no automation, no chaining BLAST

slide-10
SLIDE 10

Programmatic access

 The servers can be accessed with scripts*, and there are

awesome libraries that provide wrappers, data structures etc. But here’s the rub…

* (there are a few GUI pipeline apps but usability is an issue </diplomatic>)

slide-11
SLIDE 11

“What’s a command line?”

Exhibit A: Experimental Biologist

slide-12
SLIDE 12

TRAPPIST

Totally Rad Analysis Pipelines Python Super Tool

slide-13
SLIDE 13

TRAPPIST

Totally Rad Analysis Pipelines Python Super Tool

slide-14
SLIDE 14

TRAPPIST

Totally Rad Analysis Pipelines Python Super Tool

slide-15
SLIDE 15

Example pipeline / workflow (my research)

CONTIG FISHER CONSERVATION SORTER PHYLOGENY CONSTRUCT SEQUENCE VARIATION list of genomes genome DBs reference list of targets plasmids

OR

PHYLOGENY CONGRUENCE STRUCTURE COMPARATOR HOST HK_SET EXTRACTOR list of genes segment sets CONTENT FUNCTIONS plasmid trees

] [

H_K sets

] [

host trees reference + constructs

Data Input Core Analysis Optional Analyses

slide-16
SLIDE 16

Do it manually?

Exhibit B: Lazy Postdoc (me)

slide-17
SLIDE 17

Long story short

Collection of one-time scripts  Toolkit  Full-featured application  DNA-based OS?

slide-18
SLIDE 18

Fundamental requirements

 Design, assembly and modification of pipelines /

workflows

 Automated execution, parameter / output versioning,

provenance data bundling, interactive visualization

slide-19
SLIDE 19

Staging / Execution

slide-20
SLIDE 20

Staging system

 Basic requirements

 intuitive  flexible  extensible

+ Pre-assembled workflows / pipelines + Hooks for external / roll-your-own functions

slide-21
SLIDE 21

Staging area - workflows

slide-22
SLIDE 22

Workflow components

 TRAPPIST provides discrete task components

for every step of analysis:

 Initial inputs selection  Data processing steps

(existing algorithms)

 Graphical output  Component I/O relies on matching ports with

data object classes

Forces validation of data type/format (not up to user)

slide-23
SLIDE 23

Component representation

slide-24
SLIDE 24

Interacting with components

slide-25
SLIDE 25

Connecting components

slide-26
SLIDE 26

Connecting components

slide-27
SLIDE 27

Connecting components

slide-28
SLIDE 28

Execution system

 Progressive / modular / dependency-aware  Parameter set versioning linked to output versioning

 Users more likely to try various parameters to test assumptions

slide-29
SLIDE 29

Flow control rule

 Component ports have “fill” status

 If all its inputs are filled, component is OK to execute

 add component to execution queue

I’m OK to go!

slide-30
SLIDE 30

Database architecture

Central DB Staging DB Execution DB Staging DB Execution DB Workflow 1 Workflow 2

slide-31
SLIDE 31
slide-32
SLIDE 32

Staging DB

 DB schema + data dump sufficient to fully describe a workflow

slide-33
SLIDE 33

Execution DB

slide-34
SLIDE 34

Enforcing good practices

 Provenance bundle including:

 Workflow schema  Parameter sets  Code version info

 Executable papers!  Reproducibility!  Science!

slide-35
SLIDE 35

Interactive visualization

slide-36
SLIDE 36

U haz questions?