Error Correcting Codes for DNA based Data Storage Shubham Chandak - - PowerPoint PPT Presentation

error correcting codes for dna based data storage
SMART_READER_LITE
LIVE PREVIEW

Error Correcting Codes for DNA based Data Storage Shubham Chandak - - PowerPoint PPT Presentation

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB 2019 Outline Motivation DNA storage setup Illumina sequencing-based DNA storage Nanopore sequencing-based DNA storage


slide-1
SLIDE 1

Error Correcting Codes for DNA based Data Storage

Shubham Chandak Stanford University ISMB/ECCB 2019

slide-2
SLIDE 2

Outline

  • Motivation
  • DNA storage setup
  • Illumina sequencing-based DNA storage
  • Nanopore sequencing-based DNA storage
  • Conclusions
slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

The amount of stored data is growing exponentially:

Source: https://www.seagate.com/our-story/data-age-2025/

slide-5
SLIDE 5

200 Petabyte

slide-6
SLIDE 6

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years

slide-7
SLIDE 7

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years

slide-8
SLIDE 8

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication

slide-9
SLIDE 9

https://catalogdna.com/uncategorized/hot-news-for-the-summer-from-catalog/

slide-10
SLIDE 10

How to store data in DNA sequences?

slide-11
SLIDE 11

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.

http://www.customarrayinc.com/

slide-12
SLIDE 12

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.

001010101010 100001010010 100100010010 100100100010 100101001010 010101001001 010101010000 Segment Binary file 0010101010101000010100101001 0001001010010010001010010100 1010010101001001010101010000 Convert to DNA AGGGGGGACCAGGC . .

slide-13
SLIDE 13

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
  • But order of sequences lost in the solution – need to add index to each segment.

000010101010101000010100101001 010001001010010010001010010100 101010010101001001010101010000 Length of index in binary segment at least log2(number of segments)

slide-14
SLIDE 14

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
  • But order of sequences lost in the solution – need to add index to each segment.
  • Some sequences have zero coverage while sequencing – erasure coding+coverage.

Figure source: https://www.usenix.org/system/files/login/articles/10_plank-online.pdf Also used in traditional storage systems (e.g., RAID)

slide-15
SLIDE 15

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
  • But order of sequences lost in the solution – need to add index to each segment.
  • Some sequences have zero coverage while sequencing – erasure coding+coverage.
  • Sequencing and synthesis cause errors – substitutions, insertions and deletions –

error correction coding+coverage.

0100010011 Data bits Encode 01000100111011 Data+parity bits Bitflip 01000101111011 Decoding 01000100111011

slide-16
SLIDE 16

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
  • But order of sequences lost in the solution – need to add index to each segment.
  • Some sequences have zero coverage while sequencing – erasure coding+coverage.
  • Sequencing and synthesis cause errors – substitutions, insertions and deletions – error correction

coding+coverage.

  • Error correction studied extensively for communication and traditional data

storage systems – information theory and coding theory.

slide-17
SLIDE 17

How to store data in DNA sequences?

  • Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
  • Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
  • But order of sequences lost in the solution – need to add index to each segment.
  • Some sequences have zero coverage while sequencing – erasure coding+coverage.
  • Sequencing and synthesis cause errors – substitutions, insertions and deletions – error correction

coding+coverage.

  • Error correction studied extensively for communication and traditional data storage systems –

information theory and coding theory.

Error/Erasure Correcting Codes enable reliable data recovery even for noisy, low cost synthesis and sequencing – likely to be the future of DNA storage.

slide-18
SLIDE 18

DNA storage setup

slide-19
SLIDE 19

Typical DNA Storage System

File

slide-20
SLIDE 20

Typical DNA Storage System

Segmentation

File

slide-21
SLIDE 21

Typical DNA Storage System

Segmentation

File Outer code Inner code

slide-22
SLIDE 22

Typical DNA Storage System

Segmentation

File Storage Outer code Inner code Synthesis

slide-23
SLIDE 23

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Outer code Inner code Synthesis

  • Duplication
  • Permutation
  • Loss
  • Corruption

Sequenced reads

slide-24
SLIDE 24

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Reconstructed file Outer code Inner code Synthesis

  • Duplication
  • Permutation
  • Loss
  • Corruption

Sequenced reads Decoding

slide-25
SLIDE 25

< 1% 10 - 15% Error rates Error rates Illumina sequencing

Portability

2nd gen sequencing 3rd gen sequencing

Nanopore sequencing

✅ ❌

Real-time

✅ ❌

Long reads

insertions deletions substitutions

Long reads Real-time Portability

mostly substitutions

Throughput

Throughput

slide-26
SLIDE 26

Previous works

  • Multiple previous works focusing on:

○ Error correction coding ○ Random access of subsets of sequences using PCR primers ○ Scalable and cost effective synthesis techniques ○ Different sequencing platforms ○ Theoretical analysis

  • 1. Yazdi, SM Hossein Tabatabaei, et al. "A rewritable, random-access DNA-based storage system." Scientific reports 5 (2015): 14138.
  • 2. Erlich, Yaniv, and Dina Zielinski. "DNA Fountain enables a robust and efficient storage architecture." Science 355.6328 (2017): 950-954.
  • 3. Organick, Lee, et al. "Random access in large-scale DNA data storage." Nature biotechnology 36.3 (2018): 242.
  • 4. Blawat, Meinolf, et al. "Forward error correction for DNA data storage." Procedia Computer Science 80 (2016): 1011-1022.
  • 5. Church, George M., Yuan Gao, and Sriram Kosuri. "Next-generation digital information storage in DNA." Science 337.6102 (2012): 1628-1628.
  • 6. Heckel, Reinhard, et al. "Fundamental limits of DNA storage systems." 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017.
  • 7. Tomek, Kyle J., et al. "Driving the scalability of DNA-based information storage systems." ACS synthetic biology (2019).
  • 8. Lenz, Andreas, et al. "Coding over sets for DNA storage." 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018.
  • 9. Lee, Henry H., et al. "Terminator-free template-independent enzymatic DNA synthesis for digital information storage." Nature communications 10.1 (2019): 2383.
slide-27
SLIDE 27

Our contribution

  • Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

slide-28
SLIDE 28

Our contribution

  • Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

  • Study theoretical tradeoff between writing cost and reading cost.
slide-29
SLIDE 29

Our contribution

  • Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

  • Study theoretical tradeoff between writing cost and reading cost.
  • Achieve better tradeoff by reducing reliance on high coverage.
slide-30
SLIDE 30

Our contribution

  • Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

  • Study theoretical tradeoff between writing cost and reading cost.
  • Achieve better tradeoff by reducing reliance on high coverage.
  • Break inner-outer code separation which is theoretically suboptimal for

short sequences.

slide-31
SLIDE 31

Our contribution

  • Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

  • Study theoretical tradeoff between writing cost and reading cost.
  • Achieve better tradeoff by reducing reliance on high coverage.
  • Break inner-outer code separation which is theoretically suboptimal for

short sequences.

  • Basecaller-decoder integration for nanopore to exploit additional

information in raw current signal.

slide-32
SLIDE 32

Illumina sequencing-based DNA storage

slide-33
SLIDE 33

Key idea

slide-34
SLIDE 34

Key idea

Segment Outer Inner Segment Code Strategy 1: Inner/outer code separation Strategy 2: Single large block code (LDPC)

slide-35
SLIDE 35

Experimental Results

  • Multiple parameter experiments, storing around 200 KB data each.
  • CustomArray synthesis, length 150 including primers.
  • Sequenced with Illumina iSeq.
  • Total error rate around 1.3% (substitution: 0.4%, deletion: 0.85%, insertion:

0.05%) – cheaper and noisier synthesis as compared to previous works.

  • Approach combines LDPC codes with heuristics for handling deletion errors.
slide-36
SLIDE 36

Experimental Results

  • 1. Y. Erlich and D. Zielinski, “DNA Fountain enables a robust and efficient storage architecture," Science, vol. 355, no. 6328, pp. 950-954, 2017.
  • 2. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.

RS+RLL [2] Fountain+RS [1]

  • Exp. 1
  • Exp. 2
  • Exp. 3
  • Exp. 4
  • Exp. 5

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Writing cost (bases/bit) Reading cost (bases/bit) Previous works This work

slide-37
SLIDE 37

Nanopore sequencing-based DNA storage

slide-38
SLIDE 38

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-39
SLIDE 39

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-40
SLIDE 40

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-41
SLIDE 41

Our approach

slide-42
SLIDE 42

Using Flappie basecaller (Oxford Nanopore)

Probabilities

Our approach

slide-43
SLIDE 43

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

AACGT

Our approach

slide-44
SLIDE 44

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Our approach

slide-45
SLIDE 45

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Basecaller probability transitions Convolutional code transitions

Our approach

slide-46
SLIDE 46

Preliminary Results

  • Around 3x-6x lower reading costs than state-of-the-art [1].
  • 1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.
slide-47
SLIDE 47

Preliminary Results

  • Around 3x-6x lower reading costs than state-of-the-art [1].
  • Significant fraction of sequences decoded from single read - theoretically

impossible using basecalled sequence with 10-15% error.

  • 1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.
slide-48
SLIDE 48

Preliminary Results

  • Around 3x-6x lower reading costs than state-of-the-art [1].
  • Significant fraction of sequences decoded from single read - theoretically

impossible using basecalled sequence with 10-15% error.

  • Suggests that raw signal carries much more information than basecalled

sequence - this can help other bioinformatics applications as well.

  • 1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.
slide-49
SLIDE 49

Conclusions and future work

  • Introduced novel coding schemes for both Illumina and nanopore based storage.
slide-50
SLIDE 50

Conclusions and future work

  • Introduced novel coding schemes for both Illumina and nanopore based storage.
  • Plan to integrate these with random access and repeated reading.
slide-51
SLIDE 51

Conclusions and future work

  • Introduced novel coding schemes for both Illumina and nanopore based storage.
  • Plan to integrate these with random access and repeated reading.
  • Long term vision: Nanopore sequencing + cheaper and noisier synthesis

techniques:

○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer

slide-52
SLIDE 52

Conclusions and future work

  • Introduced novel coding schemes for both Illumina and nanopore based storage.
  • Plan to integrate these with random access and repeated reading.
  • Long term vision: Nanopore sequencing + cheaper and noisier synthesis

techniques:

○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer

  • Core idea behind basecaller-decoder integration applicable beyond DNA storage:

○ Bioinformatics (soft-information based processing) - e.g., nanopolish ○ Communication (coding for complex and hard-to-model channels)

slide-53
SLIDE 53

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin

slide-54
SLIDE 54

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin

Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval

slide-55
SLIDE 55

Thank You

Poster session today 6pm-8pm: V-071

slide-56
SLIDE 56

Proposed approach - schematics

Encoding

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-57
SLIDE 57

Proposed approach - schematics

Encoding

Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-58
SLIDE 58

Proposed approach - schematics

Reads

Decode index using BCH

Per-index MSA & consensus Recover partial payload using sync markers if consensus length incorrect LDPC decoding based on counts of A/C/G/T at each position

Binary file

Encoding Decoding

Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-59
SLIDE 59

Deep neural network (DNN) basecaller (state-of-the-art) Viterbi convolutional decoder 10111 … 10011 … 10101 …

Soft information

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Basecaller probability transitions Convolutional code transitions

Our approach