A DNA-Based Archival Storage System James Bornholt * Randolph Lopez - - PowerPoint PPT Presentation

a dna based archival storage system
SMART_READER_LITE
LIVE PREVIEW

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez - - PowerPoint PPT Presentation

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez * Douglas M. Carmean Luis Ceze * Georg Seelig * Karin Strauss * University of Washington Microso fu Research Facebook cold storage facility 1 exabyte (10 9 GB)


slide-1
SLIDE 1

A DNA-Based Archival Storage System

James Bornholt* Randolph Lopez* Douglas M. Carmean† Luis Ceze* Georg Seelig* Karin Strauss†

* University of Washington

† Microsofu Research

slide-2
SLIDE 2

Facebook cold storage facility 1 exabyte (109 GB) 66,000 square feet

slide-3
SLIDE 3

DNA molecules as storage

slide-4
SLIDE 4

DNA molecules as storage

Extremely dense Theory: 1 exabyte in 1 in3

slide-5
SLIDE 5

DNA molecules as storage

Extremely dense Theory: 1 exabyte in 1 in3 Extremely durable Half life > 500 years

slide-6
SLIDE 6

Store

A DNA-based archival storage system

Write Read

slide-7
SLIDE 7

Store

A DNA-based archival storage system

Write Read

Redundancy and density

slide-8
SLIDE 8

Store

A DNA-based archival storage system

Write Read

Redundancy and density Efficient retrieval

slide-9
SLIDE 9

Store

A DNA-based archival storage system

Write Read

Redundancy and density Efficient retrieval Wet lab experiments

slide-10
SLIDE 10

DNA manipulation

slide-11
SLIDE 11
slide-12
SLIDE 12

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine

slide-13
SLIDE 13

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T G A C A C C T

slide-14
SLIDE 14

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each other if they are complementary:

C T G T G G A G A C A C C T

slide-15
SLIDE 15

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each other if they are complementary:

C T G T G G A G A C A C C T C T G T G G A G A C A C C T

slide-16
SLIDE 16

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each other if they are complementary:

C T G T G G A G A C A C C T C T G T G G A G A C A C C T C, G are complementary A, T are complementary

slide-17
SLIDE 17

DNA molecules

Four nucleotides:

A C G T

Adenine Cytosine Guanine Thymine DNA strand (oligonucleotide) is a linear sequence of these nucleotides

G A C A C C T

Two strands can bind to each other if they are complementary:

C T G T G G A G A C A C C T C T G T G G A G A C A C C T C, G are complementary A, T are complementary T Partial errors allowed

slide-18
SLIDE 18

DNA manipulation

Synthesis: manufacturing DNA strands GACACCT

G A C A C C T

  • Chemical synthesis process appends one nucleotide at a time
  • Maximum practical length ~200 nts
  • Typically produces thousands of copies of the strand
slide-19
SLIDE 19

DNA manipulation

Synthesis: manufacturing DNA strands GACACCT

G A C A C C T

  • Chemical synthesis process appends one nucleotide at a time
  • Maximum practical length ~200 nts
  • Typically produces thousands of copies of the strand

Sequencing: reading DNA strands GACACCT

G A C A C C T

  • Produces many reads of a strand
  • Much higher throughput than synthesis
slide-20
SLIDE 20

An archival storage system

slide-21
SLIDE 21

System overview

Archival storage system structured as a key-value store

slide-22
SLIDE 22

System overview

Archival storage system structured as a key-value store put(key, value)

slide-23
SLIDE 23

System overview

Archival storage system structured as a key-value store put(key, value) get(key)

slide-24
SLIDE 24

System overview

Archival storage system structured as a key-value store put(key, value) get(key)

slide-25
SLIDE 25

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001…

cat.jpg

slide-26
SLIDE 26

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT…

cat.jpg

slide-27
SLIDE 27

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis

cat.jpg

slide-28
SLIDE 28

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis

cat.jpg

slide-29
SLIDE 29

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg

slide-30
SLIDE 30

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg

slide-31
SLIDE 31

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

slide-32
SLIDE 32

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

slide-33
SLIDE 33

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

Sequencing

slide-34
SLIDE 34

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

Sequencing

slide-35
SLIDE 35

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

Sequencing

Organizing written data

slide-36
SLIDE 36

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

Sequencing

Organizing written data Efficient retrieval

slide-37
SLIDE 37

System overview

Archival storage system structured as a key-value store put(key, value) get(key) 01001… ACGAT… Synthesis Pool

cat.jpg cat.jpg cat.jpg

Sequencing

Organizing written data Efficient retrieval Coping with errors

slide-38
SLIDE 38

Writing data to DNA

The easy way: convert base 2 to base 4 10100011 10010001 11100111 11000101 10010100 10111101

slide-39
SLIDE 39

Writing data to DNA

The easy way: convert base 2 to base 4 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1

slide-40
SLIDE 40

Writing data to DNA

The easy way: convert base 2 to base 4 G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-41
SLIDE 41

Writing data to DNA

The easy way: convert base 2 to base 4 But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-42
SLIDE 42

Writing data to DNA

The easy way: convert base 2 to base 4

G

But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-43
SLIDE 43

Writing data to DNA

The easy way: convert base 2 to base 4

G G

But this approach isn’t feasible for more than a few bytes G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-44
SLIDE 44

Writing data to DNA

The easy way: convert base 2 to base 4

G G

But this approach isn’t feasible for more than a few bytes P[Attach] = 99% G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-45
SLIDE 45

Writing data to DNA

The easy way: convert base 2 to base 4

G G

But this approach isn’t feasible for more than a few bytes P[Attach] = 99%

99%

G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-46
SLIDE 46

Writing data to DNA

The easy way: convert base 2 to base 4

G G A

But this approach isn’t feasible for more than a few bytes P[Attach] = 99%

99% 98%

G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-47
SLIDE 47

Writing data to DNA

The easy way: convert base 2 to base 4

G G A T G C A

But this approach isn’t feasible for more than a few bytes P[Attach] = 99%

99% 98% 97% 96.1% 95.1% 94.2%

G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-48
SLIDE 48

Writing data to DNA

The easy way: convert base 2 to base 4

G G A T G C A

But this approach isn’t feasible for more than a few bytes P[Attach] = 99%

99% 98% 97% 96.1% 95.1% 94.2% 100 nts 36.6% …

G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-49
SLIDE 49

Writing data to DNA

The easy way: convert base 2 to base 4

G G A T G C A

But this approach isn’t feasible for more than a few bytes P[Attach] = 99%

99% 98% 97% 96.1% 95.1% 94.2% 100 nts 36.6% 200 nts 13.4% … …

G C A C 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-50
SLIDE 50

G C A C

Chunking data

Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-51
SLIDE 51

G C A C

Chunking data

Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C

slide-52
SLIDE 52

G C A C

Chunking data

Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C A A A A A A A C A A A G

Addresses within the value

slide-53
SLIDE 53

C A T C C G C A C

Chunking data

Break binary data into chunks stored in separate strands 10100011 10010001 11100111 11000101 10010100 10111101 2 2 0 3 2 1 0 1 3 2 1 3 3 0 1 1 2 1 1 0 2 3 3 1 G G A T T G C T T A C C G C C A G T T C A A A A A A A C A A A G C A T C C C A T C C A T G T T A T G T T A T G T T

Addresses within the value Key identifiers (“primers”)

slide-54
SLIDE 54

C A T C C G C C A G T T C A A A G A T G T T

Efficient reads

G C A C G G A T T G C T T A C C A A A A A A A C C A T C C C A T C C A T G T T A T G T T

Addresses within the value Key identifiers (“primers”)

slide-55
SLIDE 55

A T T T G C C T A C G A A A C T T G A C C G

Efficient reads

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Addresses within the value Key identifiers (“primers”)

slide-56
SLIDE 56

A T T T G C C T A C G A A A C T T G A C C G

Efficient reads

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Pool containing stored strands for all keys & values! Addresses within the value Key identifiers (“primers”)

slide-57
SLIDE 57

A T T T G C C T A C G A A A C T T G A C C G

Efficient reads

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Pool containing stored strands for all keys & values!

get(key)

cat.jpg

Addresses within the value Key identifiers (“primers”)

slide-58
SLIDE 58

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers

slide-59
SLIDE 59

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers Strands with 3 different primers

slide-60
SLIDE 60

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers

PCR

Selectively amplify strands based on their primer Strands with 3 different primers

slide-61
SLIDE 61

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers

PCR

Selectively amplify strands based on their primer Strands with 3 different primers

slide-62
SLIDE 62

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers

PCR

Selectively amplify strands based on their primer

Sample

Strands with 3 different primers Almost all strands have desired primer

slide-63
SLIDE 63

A T T T G C C T A C G A A A C T T G A C C G

Random access

A T T T G C C T A C A A A A C A C G T A G G A T T T G C C T A C C A A A C C A T T C G T

Address Primers

PCR

Selectively amplify strands based on their primer

Sample

Strands with 3 different primers Almost all strands have desired primer Reads are destructive, so replenish when necessary

slide-64
SLIDE 64

Error correction

Both synthesis and sequencing are error prone:

G G A T A G C

G G A T G C A

G G A T G A G G A T C C A

Insertions Deletions Substitutions

A

Error rates ~1% per nucleotide!

slide-65
SLIDE 65

Logical redundancy

slide-66
SLIDE 66

Logical redundancy

Address Primer Data

slide-67
SLIDE 67

Logical redundancy

Address Primer Data

slide-68
SLIDE 68

Logical redundancy

Address Primer Data

slide-69
SLIDE 69

Logical redundancy

Address Primer Data XOR redundancy provides simple error correction

slide-70
SLIDE 70

Logical redundancy

Address Primer Data XOR redundancy provides simple error correction Reserved address space to indicate redundancy data

slide-71
SLIDE 71

Wet lab results

slide-72
SLIDE 72

The process

slide-73
SLIDE 73

The process

slide-74
SLIDE 74

The process

slide-75
SLIDE 75

The process

catcatgg

slide-76
SLIDE 76

The process

catcatgg

slide-77
SLIDE 77

The process

catcatgg

slide-78
SLIDE 78

The process

catcatgg

slide-79
SLIDE 79

The process

catcatgg

slide-80
SLIDE 80

The process

catcatgg catcatgc

slide-81
SLIDE 81

The process

catcatgg catcatgc

slide-82
SLIDE 82

The process

catcatgg catcatgc

Throughput MBs/week

slide-83
SLIDE 83

Decoding

Encoded and synthesized 3 files (151 kB):

slide-84
SLIDE 84

Photo: Tara Brown / UW

slide-85
SLIDE 85

Decoding

Encoded and synthesized 3 files (151 kB):

slide-86
SLIDE 86

Decoding

Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB):

slide-87
SLIDE 87

Decoding

Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB): Sequenced and decoded the resulting amplified pool:

slide-88
SLIDE 88

Decoding

Encoded and synthesized 3 files (151 kB): Selected and PCRed one file for random access (42 kB): Sequenced and decoded the resulting amplified pool:

Recovered every bit despite errors in synthesis and sequencing

slide-89
SLIDE 89

The importance of redundancy

Address Primer Data

slide-90
SLIDE 90

The importance of redundancy

Address Primer Data

slide-91
SLIDE 91

The importance of redundancy

Address Primer Data

25 50 75 2500 5000 7500

Number of copies Frequency

If we ignore redundancy data, we cannot recover the file.

slide-92
SLIDE 92

The importance of redundancy

Address Primer Data

25 50 75 2500 5000 7500

Number of copies Frequency

Some strands are missing entirely If we ignore redundancy data, we cannot recover the file.

slide-93
SLIDE 93

Store

A DNA-based archival storage system

Write Read

Redundancy and density Efficient retrieval Wet lab experiments

slide-94
SLIDE 94

Store

A DNA-based archival storage system

Write Read

Redundancy and density Efficient retrieval Wet lab experiments

Also in the paper:

  • Reliability-density

trade-off

  • Simulation of decay
  • ver time
  • Error analysis
  • Model of truncated

strands

slide-95
SLIDE 95
slide-96
SLIDE 96

MBs/week GBs/second

slide-97
SLIDE 97

DNA productivity is growing

102 104 106 108 1010 1970 1980 1990 2000 2010

Year Productivity

Transistors on Chip Reading DNA Writing DNA

Source: Robert Carlson

slide-98
SLIDE 98

DNA technology is miniaturizing

slide-99
SLIDE 99

We’ve just barely scratched the surface

0% 25% 50% 75% 100% 0.01% 0.1% 1% 10%

Reads used Accuracy

slide-100
SLIDE 100

Our community has seen these challenges before

Simulation Scheduling Spatial addressing Programming with errors Error correction Latency-hiding

  • ptimizations

Cache locality Circuit design

slide-101
SLIDE 101

Photo: Tara Brown / UW