Error Correcting Codes for DNA based Data Storage
Shubham Chandak Stanford University ISMB/ECCB 2019
Error Correcting Codes for DNA based Data Storage Shubham Chandak - - PowerPoint PPT Presentation
Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB 2019 Outline Motivation DNA storage setup Illumina sequencing-based DNA storage Nanopore sequencing-based DNA storage
Shubham Chandak Stanford University ISMB/ECCB 2019
Outline
The amount of stored data is growing exponentially:
Source: https://www.seagate.com/our-story/data-age-2025/
40,000 x 5 TByte HDDs 40 tons 10s of years
40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years
40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication
https://catalogdna.com/uncategorized/hot-news-for-the-summer-from-catalog/
How to store data in DNA sequences?
How to store data in DNA sequences?
http://www.customarrayinc.com/
How to store data in DNA sequences?
001010101010 100001010010 100100010010 100100100010 100101001010 010101001001 010101010000 Segment Binary file 0010101010101000010100101001 0001001010010010001010010100 1010010101001001010101010000 Convert to DNA AGGGGGGACCAGGC . .
How to store data in DNA sequences?
000010101010101000010100101001 010001001010010010001010010100 101010010101001001010101010000 Length of index in binary segment at least log2(number of segments)
How to store data in DNA sequences?
Figure source: https://www.usenix.org/system/files/login/articles/10_plank-online.pdf Also used in traditional storage systems (e.g., RAID)
How to store data in DNA sequences?
error correction coding+coverage.
0100010011 Data bits Encode 01000100111011 Data+parity bits Bitflip 01000101111011 Decoding 01000100111011
How to store data in DNA sequences?
coding+coverage.
storage systems – information theory and coding theory.
How to store data in DNA sequences?
coding+coverage.
information theory and coding theory.
Error/Erasure Correcting Codes enable reliable data recovery even for noisy, low cost synthesis and sequencing – likely to be the future of DNA storage.
Typical DNA Storage System
File
Typical DNA Storage System
Segmentation
File
Typical DNA Storage System
Segmentation
File Outer code Inner code
Typical DNA Storage System
Segmentation
File Storage Outer code Inner code Synthesis
Typical DNA Storage System
Segmentation
File Storage Sequencing + Basecalling Outer code Inner code Synthesis
Sequenced reads
Typical DNA Storage System
Segmentation
File Storage Sequencing + Basecalling Reconstructed file Outer code Inner code Synthesis
Sequenced reads Decoding
< 1% 10 - 15% Error rates Error rates Illumina sequencing
❌
Portability
2nd gen sequencing 3rd gen sequencing
Nanopore sequencing
✅ ❌
Real-time
✅ ❌
Long reads
✅
insertions deletions substitutions
Long reads Real-time Portability
mostly substitutions
✅
Throughput
❌
Throughput
Previous works
○ Error correction coding ○ Random access of subsets of sequences using PCR primers ○ Scalable and cost effective synthesis techniques ○ Different sequencing platforms ○ Theoretical analysis
Our contribution
○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)
Our contribution
○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)
Our contribution
○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)
Our contribution
○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)
short sequences.
Our contribution
○ Writing cost (bases synthesized/message bit) ○ Reading cost (bases sequenced/message bit) (not coverage)
short sequences.
information in raw current signal.
Key idea
Key idea
Segment Outer Inner Segment Code Strategy 1: Inner/outer code separation Strategy 2: Single large block code (LDPC)
Experimental Results
0.05%) – cheaper and noisier synthesis as compared to previous works.
Experimental Results
RS+RLL [2] Fountain+RS [1]
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Writing cost (bases/bit) Reading cost (bases/bit) Previous works This work
… ACGTACGTACGT ... Nanopore sequencing channel
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
… ACGTACGTACGT ... Nanopore sequencing channel
VERY HARD TO MODEL AND ANALYZE FAITHFULLY
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
… ACGTACGTACGT ... Nanopore sequencing channel
VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
Our approach
Using Flappie basecaller (Oxford Nanopore)
Probabilities
Our approach
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
AACGT
Our approach
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
Decoding
Code constraints used
AACGT ACGCGT
Our approach
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
Decoding
Code constraints used
AACGT ACGCGT
Basecaller probability transitions Convolutional code transitions
Our approach
Preliminary Results
Preliminary Results
impossible using basecalled sequence with 10-15% error.
Preliminary Results
impossible using basecalled sequence with 10-15% error.
sequence - this can help other bioinformatics applications as well.
Conclusions and future work
Conclusions and future work
Conclusions and future work
techniques:
○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer
Conclusions and future work
techniques:
○ Basecaller-decoder integration works with various synthesis strategies, e.g., k-mer by k-mer
○ Bioinformatics (soft-information based processing) - e.g., nanopolish ○ Communication (coding for complex and hard-to-model channels)
Team and funding
Tsachy Weissman Mary Wootters Hanlee Ji
Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin
Team and funding
Tsachy Weissman Mary Wootters Hanlee Ji SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading
Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Dmitri Pavlichin
Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval
Poster session today 6pm-8pm: V-071
Proposed approach - schematics
Encoding
Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index
Proposed approach - schematics
Encoding
Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp
Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index
Proposed approach - schematics
Reads
Decode index using BCH
Per-index MSA & consensus Recover partial payload using sync markers if consensus length incorrect LDPC decoding based on counts of A/C/G/T at each position
Binary file
Encoding Decoding
Index BCH Payload Payload AGT ~ 10 bp ~ 6 bp ~ 84 bp
Binary file Large block LDPC encoding Segment and map to DNA Add sync marker (AGT) Attach BCH- protected index
Deep neural network (DNN) basecaller (state-of-the-art) Viterbi convolutional decoder 10111 … 10011 … 10101 …
Soft information
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
Decoding
Code constraints used
AACGT ACGCGT
Basecaller probability transitions Convolutional code transitions
Our approach