A DNA-Based Archival Storage System
James Bornholt* Randolph Lopez* Douglas M. Carmean† Luis Ceze* Georg Seelig* Karin Strauss†
* University of Washington
† Microsofu Research
A DNA-Based Archival Storage System James Bornholt * Randolph Lopez - - PowerPoint PPT Presentation
A DNA-Based Archival Storage System James Bornholt * Randolph Lopez * Douglas M. Carmean Luis Ceze * Georg Seelig * Karin Strauss * University of Washington Microso fu Research Facebook cold storage facility 1 exabyte (10 9 GB)
James Bornholt* Randolph Lopez* Douglas M. Carmean† Luis Ceze* Georg Seelig* Karin Strauss†
* University of Washington
† Microsofu Research
Facebook cold storage facility 1 exabyte (109 GB) 66,000 square feet
Extremely dense Theory: 1 exabyte in 1 in3
Extremely dense Theory: 1 exabyte in 1 in3 Extremely durable Half life > 500 years
Store
Write Read
Store
Write Read
Redundancy and density
Store
Write Read
Redundancy and density Efficient retrieval
Store
Write Read
Redundancy and density Efficient retrieval Wet lab experiments
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T G A C A C C T
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A G A C A C C T
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A G A C A C C T C T G T G G A G A C A C C T
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A G A C A C C T C T G T G G A G A C A C C T C, G are complementary A, T are complementary
Four nucleotides:
A C G T
Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides
G A C A C C T
Two strands can bind to each other if they are complementary:
C T G T G G A G A C A C C T C T G T G G A G A C A C C T C, G are complementary A, T are complementary T Partial errors allowed
Synthesis: manufacturing DNA strands GACACCT
G A C A C C T
Synthesis: manufacturing DNA strands GACACCT
G A C A C C T
Sequencing: reading DNA strands GACACCT
G A C A C C T
Archival storage system structured as a key-value store
Archival storage system structured as a key-value store put(key, value)
Archival storage system structured as a key-value store put(key, value) get(key)
Archival storage system structured as a key-value store put(key, value) get(key)
Archival storage system structured as a key-value store put(key, value) get(key) 01001…
cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT…
cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis
cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis
cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Sequencing
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Sequencing
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Sequencing
Organizing written data
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Sequencing
Organizing written data Efficient retrieval
Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool
cat.jpg cat.jpg cat.jpg
Sequencing
Organizing written data Efficient retrieval Coping with errors
The easy way: convert base 2 to base 4 10100011 10010001 11100111 11000101 10010100 10111101
The easy way: convert base 2 to base 4 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1
The easy way: convert base 2 to base 4 G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4 But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G
But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes P[Attach] = 99% G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G
But this approach isn’t feasible for more than a few bytes P[Attach] = 99%
99%
G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G A
But this approach isn’t feasible for more than a few bytes P[Attach] = 99%
99% 98%
G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2%
G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2% 100 nts 36.6% …
G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
The easy way: convert base 2 to base 4
G G A T G C A
But this approach isn’t feasible for more than a few bytes P[Attach] = 99%
99% 98% 97% 96.1% 95.1% 94.2% 100 nts 36.6% 200 nts 13.4% … …
G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
G C A C
Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
G C A C
Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C
G C A C
Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C A A A A A A A C A A A G
Addresses within the value
C A T C C G C A C
Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C A A A A A A A C A A A G C A T C C C A T C C A T G T T A T G T T A T G T T
Addresses within the value Key identifiers (“primers”)
C A T C C G C C A G T T C A A A G A T G T T
G C A C G G A T T G C T T A C C A A A A A A A C C A T C C C A T C C A T G T T A T G T T
Addresses within the value Key identifiers (“primers”)
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Addresses within the value Key identifiers (“primers”)
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Pool containing stored strands for all keys & values! Addresses within the value Key identifiers (“primers”)
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Pool containing stored strands for all keys & values!
get(key)
cat.jpg
Addresses within the value Key identifiers (“primers”)
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers Strands with 3 different primers
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers
PCR
Selectively amplify strands based on their primer Strands with 3 different primers
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers
PCR
Selectively amplify strands based on their primer Strands with 3 different primers
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers
PCR
Selectively amplify strands based on their primer
Sample
Strands with 3 different primers Almost all strands have desired primer
A T T T G C C T A C G A A A C T T G A C C G
A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T
Address Primers
PCR
Selectively amplify strands based on their primer
Sample
Strands with 3 different primers Almost all strands have desired primer Reads are destructive, so replenish when necessary
Both synthesis and sequencing are error prone:
G G A T A G C
G G A T G C A
G G A T G A G G A T C C A
Insertions Deletions Substitutions
A
Error rates ~1% per nucleotide!
Address Primer Data
Address Primer Data
Address Primer Data
Address Primer Data XOR redundancy provides simple error correction
Address Primer Data XOR redundancy provides simple error correction Reserved address space to indicate redundancy data
catcatgg
catcatgg
catcatgg
catcatgg
catcatgg
catcatgg catcatgc
catcatgg catcatgc
catcatgg catcatgc
Throughput MBs/week
Encoded and synthesized 3 files (151 kB):
Photo: Tara Brown / UW
Encoded and synthesized 3 files (151 kB):
Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB):
Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB): Sequenced and decoded the resulting amplified pool:
Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB): Sequenced and decoded the resulting amplified pool:
Recovered every bit despite errors in synthesis and sequencing
Address Primer Data
Address Primer Data
Address Primer Data
25 50 75 2500 5000 7500
Number of copies Frequency
If we ignore redundancy data, we cannot recover the file.
Address Primer Data
25 50 75 2500 5000 7500
Number of copies Frequency
Some strands are missing entirely If we ignore redundancy data, we cannot recover the file.
Store
Write Read
Redundancy and density Efficient retrieval Wet lab experiments
Store
Write Read
Redundancy and density Efficient retrieval Wet lab experiments
Also in the paper:
trade-off
strands
102 104 106 108 1010 1970 1980 1990 2000 2010
Transistors on Chip Reading DNA Writing DNA
Source: Robert Carlson
Simulation Scheduling Spatial addressing Programming with errors Error correction Latency-hiding
Cache locality Circuit design
Photo: Tara Brown / UW