Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - PowerPoint PPT Presentation

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel

Based on joint works with: (alphabetically) • Jehoshua Bruck – Caltech • Ohad Elishco – Ben-Gurion University (now MIT) • Farzad Farnoud (Hassanzadeh) – University of Virginia • Siddharth Jain – Caltech • Yonatan Yehezkeally – Ben-Gurion University Introduction 2 / 79

Science fiction distant future dream? Introduction 3 / 79

No – It’s just around the corner! Introduction 4 / 79

DNA is a long string Genetic information is stored in DNA, which is a In E. coli bacteria, genetic information is stored in In humans, genetic information is stored in over Introduction 5 / 79 string of nucleotides: Adenine, Cytosine, Guanine, and Thymine. about 4 · 10 6 base pairs. 3 · 10 9 base pairs.

Why store information in DNA? DNA is dense! It stores information in the molecular level. Introduction 6 / 79 DNA can potentially hold 250 · 2 50 bytes ( 250 peta-byte) of information in 1 gram of DNA. If we were to use 8 Tb hard-drives to store the same amount, we’ll need 32000 hard-drives, with a total weight of about 25 tons!

OK, but why in living organisms? Living organisms replicate and solve this problem. organisms. Main disadvantage: Mutations! We need error-correcting codes. Introduction 7 / 79 • Reading from DNA is destructive, hence we need several copies. • Data longevity is (potentially) better, due to replication of • The organism’s outer shell provides extra protection. • Labeling organisms for biological studies. • Watermarking genetically modified organisms (GMOs).

Error-correcting codes – An age old story An error-correcting code has two main components: 1 An error ball: Its size and shape depend on the kind of errors the channel induces. 2 A packing of error balls: Its density affects communication efficiency. Its structure affects ease of encoding/decoding. Introduction 8 / 79

What kinds of errors do we expect? u Introduction Which is the most common? Unknown yet, but… w v v u w u w v u w w Insertion v u w v u w u w v u Deletion Substitution Duplication 9 / 79 v ′

Repeated sequences are everywhere myotonic dystrophy, Huntington’s disease, and important phenomena such as chromosome fragility, expansion diseases, silencing genes, and rapid morphological variation. Repetitions are common in other species as well, and are claimed to be 1 Lander et al. , Nature 2001. Introduction 10 / 79 More than 50 % of human genome is repeated sequences! 1 Repetitions were shown to be connected with diseases such as cancer, a major evolutionary force during vertebrate evolution. 1

Duplication processes may repeat ACTCA ACTACTCA ACTATACTCA ACTATACACTCA It is conceivable that a substantial portion of the unique genome, the part that is not known to contain repeated sequences, also has its origins in ancient repeated sequences that are no longer recognizable due to change over time. 2 2 Lander et al. , Nature 2001. Introduction 11 / 79 ⇓ ⇓ ⇓

Duplication processes may differ v v u v v R w u v v w u v w v z Introduction w u Palindromic Duplication u Interspersed Duplication End Duplication Tandem Duplication u v w v z w u v w u v w 12 / 79

A formal definition Definition Introduction 13 / 79 Let Σ be a finite alphabet, s ∈ Σ ∗ some string, and T ⊆ Σ ∗ Σ ∗ a set of string-duplication rules. A string-duplication system, S , defined by the tuple (Σ , s , T ) , is the reflexive transitive closure of T operating on s , namely, S ⊆ Σ ∗ is the minimal set for which: 1 s ∈ S . 2 s ′ ∈ S and T ∈ T imply T ( s ′ ) ∈ S . We write S = S (Σ , s , T ) .

End duplication - formally Definition (End Duplication) Introduction v w v u w v u k k 14 / 79 T end otherwise. T end uvwv x k { if x = uvw , | u | = i , | v | = k i , k ( x ) = { � } � i � 0 T end = . � i , k = S (Σ , s , T end ) . The end-duplication system is defined as S end

Tandem duplication - formally Definition (Tandem Duplication) Introduction w v v u w v u k k 15 / 79 T tan otherwise. T tan uvvw x k { if x = uvw , | u | = i , | v | = k i , k ( x ) = { � } � i � 0 T tan = . � i , k = S (Σ , s , T tan ) . The tandem-duplication system is defined as S tan

How expressive is a duplication system? Definition n Definition Introduction 16 / 79 The capacity of a string system S ⊆ Σ ∗ is defined by log 2 | S ∩ Σ n | cap ( S ) = lim sup . n →∞ Let S ⊆ Σ ∗ be a string system. We shall say S is fully expressive if for every v ∈ Σ ∗ there exist u , w ∈ Σ ∗ such that uvw ∈ S .

We are interested in: Introduction 17 / 79 • How does the capacity depend on the choice of duplication rules? • How does the capacity depend on the choice of seed string? • Which systems are fully expressive? • What is the connection between capacity and full expressiveness?

Some related previous work exists Tandem duplication was studied in the context of formal languages: Where are tandem-duplication languages located in the Chomsky hierarchy? Binary tandem-duplication languages are regular. Non-binary tandem-duplication languages are irregular. Introduction 18 / 79 • Martín-Vide and Paun, Acta Cybernetica (1999): • Dassow, Mitrana and Paun, Bull. of the EATCS (1999): • Ming-Wei, Bull. of the EATCS (2000):

More related previous work exists Tandem duplication was studied in an algorithmic context: Systems Sci. (2004): How to efficiently find tandem duplications in a string. How to efficiently find nested tandem duplications. J. Comp. Biology (2007), Brejová et al., Phil. Trans. R. Soc. A (2014): How to reconstruct the derivation process of a tandem-duplicated string. Introduction 19 / 79 • Main and Lorentz, J. Alg. (1984), Gusfield and Stoye, J. Comp. and • Matroud, Hendy, and Tuffley, Nucleic Acids Research (2011): • Elemento et al., Molecular Bio. and Evolution (2002), Lajoie et al.,

End duplication has full capacity Theorem For S end k k k Assumption End Duplication 20 / 79 ) , | s | � k, = S (Σ , s , T end cap ( S end ) = log 2 | Σ | . The initial string s contains every symbols of Σ at least once.

End duplication has full capacity (Cont.) Proof. End Duplication n n k 21 / 79 k � : We obviously have, � ∩ Σ n � � S end � log 2 cap ( S end ) = lim sup n →∞ log 2 | Σ n | � lim sup n →∞ = log 2 | Σ | .

End duplication has full capacity (Cont.) Proof. string y with w as a suffix. k End Duplication 22 / 79 � : We claim that starting with any string s ∈ Σ � k , with each symbol appearing at least once, and any w = w 1 w 2 . . . w k ∈ Σ k , we can derive a Step I: Duplicate prefix. Assume s = uv , | u | = k , then s = uv ⇒ uvu = s ′ . Observation: Every symbol of Σ appears in the beginning and end of a k -substring of s ′ . Step II: Force w 1 at the end. ⇒ w 1 w 1

End duplication has full capacity (Cont.) Proof. End Duplication k and then 23 / 79 k Step III: Force w 1 w 2 at the end. ⇒ w 2 w 1 w 1 w 2 ⇒ w 1 w 2 w 1 w 2 Repeat Step III inductively to get w 1 w 2 . . . w k as a suffix.

End duplication has full capacity (Cont.) Proof. End Duplication systems are fully expressive. k S end Corollary n k 24 / 79 k substring. Step IV: Repeat previous steps to get every k -word from Σ k as a Thus, after at most 2 k | Σ | k duplications we get a string s ′′ containing all possible k -substrings, | s ′′ | � 2 k 2 | Σ | k . For any n = | s ′′ | + tk we can now create | Σ | tk distinct strings. Hence, � ∩ Σ n � log 2 ( | Σ | tk ) � S end � log 2 � lim sup � log 2 | Σ | . cap ( S end ) = lim sup | s ′′ | + tk n →∞ t →∞

Tandem duplication behaves differently But first… Tandem Duplication otherwise, q , 25 / 79 q Definition Main tool – φ k -transform domain. We assume WLOG that Σ = Z q . We define the transform φ k : Z � k q × Z ∗ → Z k q by, φ k ( x ) = ( Pref k ( x ) , Suff | x |− k ( x ) − Pref | x |− k ( x )) , q × Z ∗ q × Z ∗ q → Z k as well as ζ i , k : Z k { if y = uw , | u | = i ( x , u 0 k w ) ζ i , k ( x , y ) = ( x , y ) where Pref i ( x ) and Suff i ( x ) are, respectively, the i -prefix and i -suffix of x .

26 / 79 q The following diagram commutes: Tandem Duplication q T tan q q Lemma Main tool - φ k -transform domain Z � k Z � k − − − − → i , k    � φ k  � φ k ζ i , k q × Z ∗ q × Z ∗ − − − − → Z k Z k i.e., for every string x ∈ Z � k q , φ k ( T tan i , k ( x )) = ζ i , k ( φ k ( x )) .

27 / 79 Example Tandem Duplication where the inserted elements are underlined. T tan Main tool - φ k -transform domain Assume Σ = Z 4 . Starting with 02123 and letting i = 1 and k = 2 leads to 1 , 2 02123 − − − − → 0212123     � φ 2 � φ 2 ζ 1 , 2 (02 , 102) − − − − → (02 , 10002)

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - PowerPoint PPT Presentation

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel Based on joint works with: (alphabetically) Jehoshua Bruck Caltech Ohad Elishco Ben-Gurion

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, Farzad Farnoud, Moshe Schwartz,

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

Classification Species of Organisms 13 billion known species of organisms This represents

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

DNA Nucleus Contains cells genetic info (DNA) controls cell functions DNA Structure

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Go Bananas! Introduction Tell you about DNA Show you how to extract DNA from a Banana

1000 Downloads of Genetically Improved DNA Analysis Software CREST Open Workshop on Genetic

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Principles and Applicaons of Modern Principles and Applicaons of Modern DNA Sequencing DNA

Design of Digital Logic by Genetic Regulatory Networks Ron Weiss Department of Electrical

Using evolutionary computing to optimise BarraCUDA UKMAC 2016 W. B. Langdon Computer Science,

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

The Contribution of Bioinformatics to Evolutionary Thought A demonstration of the abilities of

CSE 527 Computational Biology Lectures 13-14 Gene Prediction Some References (more on schedule

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - PowerPoint PPT Presentation

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel Based on joint works with: (alphabetically) Jehoshua Bruck Caltech Ohad Elishco Ben-Gurion

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, Farzad Farnoud, Moshe Schwartz,

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

Classification Species of Organisms 13 billion known species of organisms This represents

Error Correcting Codes for DNA based Data Storage Shubham Chandak Stanford University ISMB/ECCB

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

DNA Nucleus Contains cells genetic info (DNA) controls cell functions DNA Structure

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Go Bananas! Introduction Tell you about DNA Show you how to extract DNA from a Banana

1000 Downloads of Genetically Improved DNA Analysis Software CREST Open Workshop on Genetic

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Principles and Applicaons of Modern Principles and Applicaons of Modern DNA Sequencing DNA

Design of Digital Logic by Genetic Regulatory Networks Ron Weiss Department of Electrical

Using evolutionary computing to optimise BarraCUDA UKMAC 2016 W. B. Langdon Computer Science,

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

The Contribution of Bioinformatics to Evolutionary Thought A demonstration of the abilities of

CSE 527 Computational Biology Lectures 13-14 Gene Prediction Some References (more on schedule

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for