[PPT] - darwin: a Scalable Version Control System for Genomic Data Danny PowerPoint Presentation

SLIDE 1

darwin: a Scalable Version Control System for Genomic Data

Danny McClanahan, Vanderbilt University Software

SLIDE 2

Abstract

Synthetic biologists create genomes by editing

DNA text directly.

Changes made are difficult to track, which leads to

security problems.

No software exists to track changes which works

with genome-scale data.

darwin is a software package to document and

track collaborative changes to DNA on the genome scale.

SLIDE 3

Basic Biology Review

ORF (open reading frame) codes for a protein
Are therefore the interesting parts of a gene
Can have multiple ORFs per gene
Translated by ribosomes in the cell
Has special start and end markers
Ribosome uses these to determine where to

begin and end translation into proteins

SLIDE 4

What is Version Control?

Record every change made to a file or set of files
When, What, Who
Merge changes by multiple collaborators
Ensures every member of team has updated copy
Typical tool used is called git

SLIDE 5

How Git Processes Files

git is a line-based system
Only records lines added and deleted
The more lines in a file, the longer it takes git to

process.

This makes it inefficient for processing DNA files

AAAAAAAAA BBBBBBBBB CCCCCCCC AAAAAAAAA CCCCCCCC DDDDDDDD previous current

BBBBBBBBB

+DDDDDDDD changes recorded

SLIDE 6

What darwin Does

darwin preprocesses DNA files before putting

them through git

Create temporary file which is optimized so git

performs fewer operations and runs faster

Put temporary file in git
Reconstruct original file from temporary file
Makes version control of genomic data feasible

by increasing the speed at which git processes data.

SLIDE 7

Approach Part 1: Split by ORF

FASTA/GenBank/ApE/etc typically formatted in

fixed-length lines

e.g.:
FASTA (typically 50 or 70 characters per line):
ApE (typically 76 characters per line):

CATACAATCCAGGTTTTAATCATCAGAAATCACAGTCCTATTGTCTTCTGCACAGACCCAAACACACTTG GAGGTCATGTTCAATATGAATACCTCACAGAGAAGGAAATTTACACGCGAGAAGTACATCTGCAGAAAGC CAGCTGGCATGTCAACCATTCAAAAACTCAGGGTGTTCTGGATAAAGAAGACTCAGGAAGACAAGTATGA AGCATAATCTGTGACATTCCATGCGGCAGACATTAGACACATACAAGAGAGTTGTTGGAAAGCGGAATTT ATCTTCATATAAACAACACTGAGCTAAATCTCAATATTTCAGATCTCTAGAACTATCCATCAGTGAAATG 1 TCGCGCGTTT CGGTGATGAC GGTGAAAACC TCTGACACAT GCAGCTCCCG GAGACGGTCA 61 CAGCTTGTCT GTAAGCGGAT GCCGGGAGCA GACAAGCCCG TCAGGGCGCG TCAGCGGGTG 121 TTGGCGGGTG TCGGGGCTGG CTTAACTATG CGGCATCAGA GCAGATTGTA CTGAGAGTGC 181 ACCATATGCG GTGTGAAATA CCGCACAGAT GCGTAAGGAG AAAATACCGC ATCAGGCGCC

SLIDE 8

Approach Part 1: Split by ORF

Remove formatting
f FASTA/ApE/
GenBank/etc
Split file into lines

by ORF

Changes to single

ORF now only affect single line

Temporary file

produced is now much smaller

SLIDE 9

Approach Part 1: Split by ORF

Output files now look like this
Note that lines are now varying length, and

alternating between ORF and non-ORF

Adding or modifying an ORF now only changes a

single line of output

CATACAATCCAGGTTTTAATCATCAGAAATCACAGTCCTATTGTCTTCTGCACAGACCCAAACACACTTG GAGGTC ATGTTCAATATGAATACCTCACAGAGAAGGAAATTTACACGCGAGAAGTACATCTGCAGAAAGC CAGCTGGCATGTCAACCATTCAAAAACTCAGGGTGTTCTGGATAA AGAAGACTCAGGAAGACAAGT ATGA AGCATAATCTGTGACATTCCATGCGGCAGACATTAGACACATACAAGAGAGTTGTTGGAAAGCGGAATTT ATCTTCATATAA ACAACACTGAGCTAAATCTCAATATTTCAGATCTCTAGAACTATCCATCAGTGAA ATG

SLIDE 10

Approach Part 2: Edits within ORF

Consider adding a few amino acids at the

beginning of a long ORF:

Before: ATGAGAGGCGGTTGC...
After: ATGAAAAGCATAAGAGGCGGTTGC...
Since git only sees changes in lines, it counts the

same as adding and removing an entire ORF

This could be thousands of characters changed

for a single small insertion

SLIDE 11

Approach Part 2: Edits within ORF

ATGAGAG… ATGAAAAGCATAAGAG… previous current

ATGAGAG…

+ATGAAAAGCATAAGAG… changes recorded

SLIDE 12

Approach Part 2: Edits within ORF

Identify ORFs

that have only small edits between two versions of file

Find only those

small changes that were made and record those

Actual ORF can

be reconstructed from previous ORF + changes

SLIDE 13

Approach Part 2: Edits within ORF

Previous example:
Before: ATGAGAGGCGGTTGC...
After:

ATGAAAAGCATAAGAGGCGGTTGC...

This turns into:
ATGAGAGGCGGTTGCA...
+AAAAGCATA@3
Short line of edits added, not entire long ORF

SLIDE 14

Approach Part 3: Use of Concurrency

Water bucket analogy
File I/O (input-output) is extremely slow
darwin has to do both input and output
So use concurrency to continue to do work

while waiting for slow file operations

SLIDE 15

Approach Part 3: Use of Concurrency

Create queues of “buckets” of input and output
First bucket passed from file reader to processor
File reader continues reading while processor

completes

Finally, bucket passed from processor to writer

SLIDE 16

Approach Part 3: Use of Concurrency

Perform four cycles side-by-side in same time as two

cycles without concurrency

Massive pipelined speedup available

SLIDE 17

Results

Tested on multiple

iterations of ApE files from Vanderbilt wetware team

darwin made

processing files with git about twice as fast

Speedup

SLIDE 18

Results

Data about experimental setup
40,000 trials run on four successive iterations of a

real-life DNA file

“wall-clock time” used to measure time actually

visible to the user

Why do results matter?
This experiment shows that even a draft copy of the

software can achieve extremely impressive results.

SLIDE 19

Future Work

More filetypes:
2bit, SAM/BAM, etc
GUI
Further optimization

SLIDE 20

Project Summary

darwin is a software package to document

changes to DNA.

Allows for easy, standardized, and collaborative

editing on DNA data up to the genome scale.

Builds off of tested and proven version control

software.

Uses algorithms to preprocess DNA files and

log changes twice as fast as the current method.

SLIDE 21

Acknowledgements

Mitchell Gordon, for software development.
Jules White, for advice and help.
VUSE, and specifically the EECE department,

darwin: a Scalable Version Control System for Genomic Data

Abstract

DNA text directly.

security problems.

with genome-scale data.

track collaborative changes to DNA on the genome scale.

Basic Biology Review

begin and end translation into proteins

What is Version Control?

How Git Processes Files

process.

AAAAAAAAA BBBBBBBBB CCCCCCCC AAAAAAAAA CCCCCCCC DDDDDDDD previous current

+DDDDDDDD changes recorded

What darwin Does

them through git

performs fewer operations and runs faster

by increasing the speed at which git processes data.

Approach Part 1: Split by ORF

fixed-length lines

Approach Part 1: Split by ORF

by ORF

ORF now only affect single line

produced is now much smaller

Approach Part 1: Split by ORF

alternating between ORF and non-ORF

single line of output

Approach Part 2: Edits within ORF

beginning of a long ORF:

same as adding and removing an entire ORF

for a single small insertion

Approach Part 2: Edits within ORF

ATGAGAG… ATGAAAAGCATAAGAG… previous current

+ATGAAAAGCATAAGAG… changes recorded

Approach Part 2: Edits within ORF

that have only small edits between two versions of file

small changes that were made and record those

be reconstructed from previous ORF + changes

Approach Part 2: Edits within ORF

ATGAAAAGCATAAGAGGCGGTTGC...

Approach Part 3: Use of Concurrency

while waiting for slow file operations

Approach Part 3: Use of Concurrency

completes

Approach Part 3: Use of Concurrency

cycles without concurrency

Results

iterations of ApE files from Vanderbilt wetware team

processing files with git about twice as fast

Speedup

Results

real-life DNA file

visible to the user

software can achieve extremely impressive results.

Future Work

Project Summary

changes to DNA.

editing on DNA data up to the genome scale.

software.

log changes twice as fast as the current method.

Acknowledgements

for their support throughout this project.