Robust Data Storage in DNA with Error-Correcting Codes Robert Grass - - PowerPoint PPT Presentation

robust data storage in dna with error correcting codes
SMART_READER_LITE
LIVE PREVIEW

Robust Data Storage in DNA with Error-Correcting Codes Robert Grass - - PowerPoint PPT Presentation

Robust Data Storage in DNA with Error-Correcting Codes Robert Grass and Reinhard Heckel ETH Zurich , IBM Research Zurich September 28, 2015 Research performed at ETH Zurich Thanks to Prof. W. Stark, D. Paunescu, M. Puddu DNA


slide-1
SLIDE 1

Robust Data Storage in DNA with Error-Correcting Codes

Robert Grass⋆ and Reinhard Heckel∗⋆ ETH Zurich⋆, IBM Research Zurich∗ September 28, 2015

Research performed at ETH Zurich Thanks to Prof. W. Stark, D. Paunescu, M. Puddu

slide-2
SLIDE 2

DNA

◮ DNA is a molecule storing genetic information of organisms ◮ We view DNA as a string of four different nucleotides

A C G T . . . G A C T. . .

2 / 15

slide-3
SLIDE 3

Storing information in DNA

Binary data can be encoded as a DNA string:

01011010 encode ACGTACGT

◮ First message written on DNA in the 90s ◮ Church et al. (Science 2012) and Goldman et al. (Nature

2013) stored about 1Mb on DNA However, previous approaches are not robust!

3 / 15

slide-4
SLIDE 4

Our contribution

◮ Making DNA data storage robust and cheaper by using

error-correction codes and storing DNA in synthetic fossils

◮ Information can be recovered from DNA stored at the Global

Seed Vault (-18C) after over 1 million years

4 / 15

slide-5
SLIDE 5

Motivation: Maximum storage times

1 10 100 1000 10.000 100.000 years Archimedes “The Method” DNA in ancient bone

5 / 15

slide-6
SLIDE 6

Motivation: Information density

0.01 0.1 1 10 100 1000 Gbit/mm3 Our work

6 / 15

slide-7
SLIDE 7

DNA is not a disc: The DNA channel

◮ Can only read and write short DNA segments ◮ Segments can not be spatially ordered

ACATACGT CATGTACA GCTATGCC

01011010 encode synthesize sequence

GCTATGCC CATGTACA ACATACGT

decode 01011010

Error sources:

◮ Individual base errors: ‘CTACA...’ instead of ‘ATACG...’ ◮ Loss of complete sequences

7 / 15

slide-8
SLIDE 8

Encoding and decoding scheme

encode

  • uter code

add unique index to each symbol encode each column

· · · DNA synthesis DNA sequencing decode inner code · · · · · · sort and remove indices decode

  • uter code

8 / 15

slide-9
SLIDE 9

Encoding and decoding scheme

Inner code:

◮ Reed-Solomon code over GF(47) with n = 39, k = 33 ◮ Corrects individual base errors

Outer code:

◮ Reed-Solomon code over extension field GF(4730) with

N = 713, K = 594

◮ Recovers lost sequences (erasures) ◮ Corrects errors from the inner decoder

Why GF(47)?

◮ Allows to avoid runs of length > 3 such that ‘CTAGGGG’

which result in a significant increase of reading errors Information theoretically close to optimal

9 / 15

slide-10
SLIDE 10

Protecting DNA from decay and environment

Dry storage of DNA in amber in bone in silica

10 / 15

slide-11
SLIDE 11

Protection through DNA encapsulation in silica

Paunescu, Fuhrer, Grass, Angew. Chem. Int. Ed. 2013. Paunescu, Grass et al. Nat. Protoc. 2013

11 / 15

slide-12
SLIDE 12

Accelerated aging experiment

Archimedes: “The Method”

Give me a place to stand and with a lever I will move the whole world...

encode

ACATACGT CATGTACA GCTATGCC

synthesis encapsulation release sequencing decode

ACATACGT CATGTACA GCTATGCC Give me a place to stand and with a lever I will move the whole world...

storage at 70◦

12 / 15

slide-13
SLIDE 13

Errors in and loss of whole sequences

59 68 71 2.8 5.5 8.9 0.4 4.5 3 0.3 2.5 0.5 initial error error after inner decoding error outer code erasure final error

Original DNA 1/2 week at 70◦ 1 week at 70◦

◮ In all cases the information could be reconstructed perfectly ◮ 1 week at 70◦ = 2000 years in Zurich = 2 million years at

Global Seed Vault (−18.8◦)

13 / 15

slide-14
SLIDE 14

Errors in individual sequences

A 2 C A 2 G A 2 T C 2 A C 2 G C 2 T G 2 A G 2 C G 2 T T 2 A T 2 C T 2 G

0.5 1 error probability in % Original DNA 1/2 week at 70◦ 1 week at 70◦

14 / 15

slide-15
SLIDE 15

Conclusion

◮ Digital information can be stored

robustly for thousands of years in DNA

◮ Only the combination of

error-correction and DNA encapsulation in silica enables long-term storage

Thank you!

15 / 15