Error Correcting Codes for DNA based Data Storage Shubham Chandak - - PowerPoint PPT Presentation

▶

error correcting codes for dna based data storage

Error Correcting Codes for DNA based Data Storage Shubham Chandak - - PowerPoint PPT Presentation

Aug 16, 2023 263 likes •858 views

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB 2019 Outline Motivation DNA storage setup Illumina sequencing-based DNA storage Nanopore sequencing-based DNA storage

slide-1

SLIDE 1

Error Correcting Codes for DNA based Data Storage

Shubham Chandak Stanford University ISMB/ECCB 2019

slide-2

SLIDE 2

Outline

Motivation
DNA storage setup
Illumina sequencing-based DNA storage
Nanopore sequencing-based DNA storage
Conclusions

slide-3

SLIDE 3

Motivation

slide-4

SLIDE 4

The amount of stored data is growing exponentially:

Source: https://www.seagate.com/our-story/data-age-2025/

slide-5

SLIDE 5

200 Petabyte

slide-6

SLIDE 6

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years

slide-7

SLIDE 7

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years

slide-8

SLIDE 8

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication

slide-9

SLIDE 9

https://catalogdna.com/uncategorized/hot-news-for-the-summer-from-catalog/

slide-10

SLIDE 10

How to store data in DNA sequences?

slide-11

SLIDE 11

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.

http://www.customarrayinc.com/

slide-12

SLIDE 12

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.

001010101010 100001010010 100100010010 100100100010 100101001010 010101001001 010101010000 Segment Binary file 0010101010101000010100101001 0001001010010010001010010100 1010010101001001010101010000 Convert to DNA AGGGGGGACCAGGC . .

slide-13

SLIDE 13

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
But order of sequences lost in the solution – need to add index to each segment.

000010101010101000010100101001 010001001010010010001010010100 101010010101001001010101010000 Length of index in binary segment at least log2(number of segments)

slide-14

SLIDE 14

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
But order of sequences lost in the solution – need to add index to each segment.
Some sequences have zero coverage while sequencing – erasure coding+coverage.

Figure source: https://www.usenix.org/system/files/login/articles/10_plank-online.pdf Also used in traditional storage systems (e.g., RAID)

slide-15

SLIDE 15

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
But order of sequences lost in the solution – need to add index to each segment.
Some sequences have zero coverage while sequencing – erasure coding+coverage.
Sequencing and synthesis cause errors – substitutions, insertions and deletions –

error correction coding+coverage.

0100010011 Data bits Encode 01000100111011 Data+parity bits Bitflip 01000101111011 Decoding 01000100111011

slide-16

SLIDE 16

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
But order of sequences lost in the solution – need to add index to each segment.
Some sequences have zero coverage while sequencing – erasure coding+coverage.
Sequencing and synthesis cause errors – substitutions, insertions and deletions – error correction

coding+coverage.

Error correction studied extensively for communication and traditional data

storage systems – information theory and coding theory.

slide-17

SLIDE 17

How to store data in DNA sequences?

Ability to synthesize short ssDNA oligonucleotides (~150 nt) at scale.
Convert binary data to A/C/G/T alphabet: e.g., 00 – A, 01 – C, etc.
But order of sequences lost in the solution – need to add index to each segment.
Some sequences have zero coverage while sequencing – erasure coding+coverage.
Sequencing and synthesis cause errors – substitutions, insertions and deletions – error correction

coding+coverage.

Error correction studied extensively for communication and traditional data storage systems –

information theory and coding theory.

Error/Erasure Correcting Codes enable reliable data recovery even for noisy, low cost synthesis and sequencing – likely to be the future of DNA storage.

slide-18

SLIDE 18

DNA storage setup

slide-19

SLIDE 19

Typical DNA Storage System

File

slide-20

SLIDE 20

Typical DNA Storage System

Segmentation

File

slide-21

SLIDE 21

Typical DNA Storage System

Segmentation

File Outer code Inner code

slide-22

SLIDE 22

Typical DNA Storage System

Segmentation

File Storage Outer code Inner code Synthesis

slide-23

SLIDE 23

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Outer code Inner code Synthesis

Duplication
Permutation
Loss
Corruption

Sequenced reads

slide-24

SLIDE 24

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Reconstructed file Outer code Inner code Synthesis

Duplication
Permutation
Loss
Corruption

Sequenced reads Decoding

slide-25

SLIDE 25

< 1% 10 - 15% Error rates Error rates Illumina sequencing

❌

Portability

2nd gen sequencing 3rd gen sequencing

Nanopore sequencing

✅ ❌

Real-time

✅ ❌

Long reads

✅

insertions deletions substitutions

Long reads Real-time Portability

mostly substitutions

✅

Throughput

❌

Throughput

slide-26

SLIDE 26

Previous works

Multiple previous works focusing on:

○ Error correction coding ○ Random access of subsets of sequences using PCR primers ○ Scalable and cost effective synthesis techniques ○ Different sequencing platforms ○ Theoretical analysis

1. Yazdi, SM Hossein Tabatabaei, et al. "A rewritable, random-access DNA-based storage system." Scientific reports 5 (2015): 14138.
2. Erlich, Yaniv, and Dina Zielinski. "DNA Fountain enables a robust and efficient storage architecture." Science 355.6328 (2017): 950-954.
3. Organick, Lee, et al. "Random access in large-scale DNA data storage." Nature biotechnology 36.3 (2018): 242.
4. Blawat, Meinolf, et al. "Forward error correction for DNA data storage." Procedia Computer Science 80 (2016): 1011-1022.
5. Church, George M., Yuan Gao, and Sriram Kosuri. "Next-generation digital information storage in DNA." Science 337.6102 (2012): 1628-1628.
6. Heckel, Reinhard, et al. "Fundamental limits of DNA storage systems." 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017.
7. Tomek, Kyle J., et al. "Driving the scalability of DNA-based information storage systems." ACS synthetic biology (2019).
8. Lenz, Andreas, et al. "Coding over sets for DNA storage." 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018.
9. Lee, Henry H., et al. "Terminator-free template-independent enzymatic DNA synthesis for digital information storage." Nature communications 10.1 (2019): 2383.

slide-27

SLIDE 27

Our contribution

Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

slide-28

SLIDE 28

Our contribution

Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

Study theoretical tradeoff between writing cost and reading cost.

slide-29

SLIDE 29

Our contribution

Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

Study theoretical tradeoff between writing cost and reading cost.
Achieve better tradeoff by reducing reliance on high coverage.

slide-30

SLIDE 30

Our contribution

Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

Study theoretical tradeoff between writing cost and reading cost.
Achieve better tradeoff by reducing reliance on high coverage.
Break inner-outer code separation which is theoretically suboptimal for

short sequences.

slide-31

SLIDE 31

Our contribution

Fundamental quantities to evaluate a DNA storage system:

○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)

Study theoretical tradeoff between writing cost and reading cost.
Achieve better tradeoff by reducing reliance on high coverage.
Break inner-outer code separation which is theoretically suboptimal for

short sequences.

Basecaller-decoder integration for nanopore to exploit additional

information in raw current signal.

slide-32

SLIDE 32

Illumina sequencing-based DNA storage

slide-33

SLIDE 33

Key idea

slide-34

SLIDE 34

Key idea

Segment Outer Inner Segment Code Strategy 1: Inner/outer code separation Strategy 2: Single large block code (LDPC)

slide-35

SLIDE 35

Experimental Results

Multiple parameter experiments, storing around 200 KB data each.
CustomArray synthesis, length 150 including primers.
Sequenced with Illumina iSeq.
Total error rate around 1.3% (substitution: 0.4%, deletion: 0.85%, insertion:

0.05%) – cheaper and noisier synthesis as compared to previous works.

Approach combines LDPC codes with heuristics for handling deletion errors.

slide-36

SLIDE 36

Experimental Results

1. Y. Erlich and D. Zielinski, “DNA Fountain enables a robust and efficient storage architecture," Science, vol. 355, no. 6328, pp. 950-954, 2017.
2. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.

RS+RLL [2] Fountain+RS [1]

Exp. 1
Exp. 2
Exp. 3
Exp. 4
Exp. 5

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Writing cost (bases/bit) Reading cost (bases/bit) Previous works This work

slide-37

SLIDE 37

Nanopore sequencing-based DNA storage

slide-38

SLIDE 38

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-39

SLIDE 39

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-40

SLIDE 40

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-41

SLIDE 41

Our approach

slide-42

SLIDE 42

Using Flappie basecaller (Oxford Nanopore)

Probabilities

Our approach

slide-43

SLIDE 43

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

AACGT

Our approach

slide-44

SLIDE 44

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Our approach

slide-45

SLIDE 45

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Basecaller probability transitions Convolutional code transitions

Our approach

slide-46

SLIDE 46

Preliminary Results

Around 3x-6x lower reading costs than state-of-the-art [1].
1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.

slide-47

SLIDE 47

Preliminary Results

Around 3x-6x lower reading costs than state-of-the-art [1].
Significant fraction of sequences decoded from single read - theoretically

impossible using basecalled sequence with 10-15% error.

1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.

slide-48

SLIDE 48

Preliminary Results

Around 3x-6x lower reading costs than state-of-the-art [1].
Significant fraction of sequences decoded from single read - theoretically

impossible using basecalled sequence with 10-15% error.

Suggests that raw signal carries much more information than basecalled

sequence - this can help other bioinformatics applications as well.

1. L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018.

slide-49

SLIDE 49

Conclusions and future work

Introduced novel coding schemes for both Illumina and nanopore based storage.

slide-50

SLIDE 50

Conclusions and future work

Introduced novel coding schemes for both Illumina and nanopore based storage.
Plan to integrate these with random access and repeated reading.

slide-51

SLIDE 51

Conclusions and future work

Introduced novel coding schemes for both Illumina and nanopore based storage.
Plan to integrate these with random access and repeated reading.
Long term vision: Nanopore sequencing + cheaper and noisier synthesis

techniques:

○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer

slide-52

SLIDE 52

Conclusions and future work

Introduced novel coding schemes for both Illumina and nanopore based storage.
Plan to integrate these with random access and repeated reading.
Long term vision: Nanopore sequencing + cheaper and noisier synthesis

techniques:

○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer

Core idea behind basecaller-decoder integration applicable beyond DNA storage:

○ Bioinformatics (soft-information based processing) - e.g., nanopolish ○ Communication (coding for complex and hard-to-model channels)

slide-53

SLIDE 53

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin

slide-54

SLIDE 54

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin

Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval

slide-55

SLIDE 55

Thank You

Poster session today 6pm-8pm: V-071

slide-56

SLIDE 56

Proposed approach - schematics

Encoding

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-57

SLIDE 57

Proposed approach - schematics

Encoding

Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-58

SLIDE 58

Proposed approach - schematics

Reads

Decode index using BCH

Per-index MSA & consensus Recover partial payload using sync markers if consensus length incorrect LDPC decoding based on counts of A/C/G/T at each position

Binary file

Encoding Decoding

Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp

Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index

slide-59

SLIDE 59

Deep neural network (DNN) basecaller (state-of-the-art) Viterbi convolutional decoder 10111 … 10011 … 10101 …

Soft information

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Basecaller probability transitions Convolutional code transitions

Our approach