Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - - PowerPoint PPT Presentation

coding for dna storage in live organisms
SMART_READER_LITE
LIVE PREVIEW

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical - - PowerPoint PPT Presentation

Coding for DNA Storage in Live Organisms Moshe Schwartz Electrical & Computer Engineering Ben-Gurion University Israel Based on joint works with: (alphabetically) Jehoshua Bruck Caltech Ohad Elishco Ben-Gurion


slide-1
SLIDE 1

Coding for DNA Storage in Live Organisms

Moshe Schwartz

Electrical & Computer Engineering Ben-Gurion University Israel

slide-2
SLIDE 2

Based on joint works with: (alphabetically)

  • Jehoshua Bruck – Caltech
  • Ohad Elishco – Ben-Gurion University (now MIT)
  • Farzad Farnoud (Hassanzadeh) – University of Virginia
  • Siddharth Jain – Caltech
  • Yonatan Yehezkeally – Ben-Gurion University

Introduction 2 / 79

slide-3
SLIDE 3

Science fiction distant future dream?

Introduction 3 / 79

slide-4
SLIDE 4

No – It’s just around the corner!

Introduction 4 / 79

slide-5
SLIDE 5

DNA is a long string

Genetic information is stored in DNA, which is a string of nucleotides: Adenine, Cytosine, Guanine, and Thymine. In E. coli bacteria, genetic information is stored in about 4 · 106 base pairs. In humans, genetic information is stored in over 3 · 109 base pairs.

Introduction 5 / 79

slide-6
SLIDE 6

Why store information in DNA?

DNA is dense!

It stores information in the molecular level. DNA can potentially hold 250 · 250 bytes (250 peta-byte) of information in 1 gram of DNA. If we were to use 8Tb hard-drives to store the same amount, we’ll need 32000 hard-drives, with a total weight of about 25 tons!

Introduction 6 / 79

slide-7
SLIDE 7

OK, but why in living organisms?

  • Reading from DNA is destructive, hence we need several copies.

Living organisms replicate and solve this problem.

  • Data longevity is (potentially) better, due to replication of
  • rganisms.
  • The organism’s outer shell provides extra protection.
  • Labeling organisms for biological studies.
  • Watermarking genetically modified organisms (GMOs).

Main disadvantage:

Mutations!

We need error-correcting codes.

Introduction 7 / 79

slide-8
SLIDE 8

Error-correcting codes – An age old story

An error-correcting code has two main components:

1 An error ball: Its size and shape depend on

the kind of errors the channel induces.

2 A packing of error balls: Its density affects

communication efficiency. Its structure affects ease of encoding/decoding.

Introduction 8 / 79

slide-9
SLIDE 9

What kinds of errors do we expect?

Insertion Duplication Substitution Deletion

u v w u w u v w u v w u v′ w u v w u w u v v w

Which is the most common? Unknown yet, but…

Introduction 9 / 79

slide-10
SLIDE 10

Repeated sequences are everywhere

More than 50% of human genome is repeated sequences!1 Repetitions were shown to be connected with diseases such as cancer, myotonic dystrophy, Huntington’s disease, and important phenomena such as chromosome fragility, expansion diseases, silencing genes, and rapid morphological variation. Repetitions are common in other species as well, and are claimed to be a major evolutionary force during vertebrate evolution.1

1Lander et al., Nature 2001. Introduction 10 / 79

slide-11
SLIDE 11

Duplication processes may repeat

ACTCA ⇓ ACTACTCA ⇓ ACTATACTCA ⇓ ACTATACACTCA It is conceivable that a substantial portion of the unique genome, the part that is not known to contain repeated sequences, also has its

  • rigins in ancient repeated sequences that are no longer recognizable

due to change over time.2

2Lander et al., Nature 2001. Introduction 11 / 79

slide-12
SLIDE 12

Duplication processes may differ

Palindromic Duplication Interspersed Duplication End Duplication Tandem Duplication

u v w u v w u v w u v w z u v w v u v vR w u v v w u v w v z Introduction 12 / 79

slide-13
SLIDE 13

A formal definition

Definition

Let Σ be a finite alphabet, s ∈ Σ∗ some string, and T ⊆ Σ∗Σ∗ a set of string-duplication rules. A string-duplication system, S, defined by the tuple (Σ, s, T ), is the reflexive transitive closure of T operating on s, namely, S ⊆ Σ∗ is the minimal set for which:

1 s ∈ S. 2 s′ ∈ S and T ∈ T imply T(s′) ∈ S.

We write S = S(Σ, s, T ).

Introduction 13 / 79

slide-14
SLIDE 14

End duplication - formally

Definition (End Duplication)

Tend

i,k (x) =

{ uvwv if x = uvw, |u| = i, |v| = k x

  • therwise.

T end

k

= { Tend

i,k

  • i 0

} . The end-duplication system is defined as Send

k

= S(Σ, s, T end

k

).

u v w u v w v Introduction 14 / 79

slide-15
SLIDE 15

Tandem duplication - formally

Definition (Tandem Duplication)

Ttan

i,k (x) =

{ uvvw if x = uvw, |u| = i, |v| = k x

  • therwise.

T tan

k

= { Ttan

i,k

  • i 0

} . The tandem-duplication system is defined as Stan

k

= S(Σ, s, T tan

k

).

u v w u v v w Introduction 15 / 79

slide-16
SLIDE 16

How expressive is a duplication system?

Definition

The capacity of a string system S ⊆ Σ∗ is defined by cap(S) = lim sup

n→∞

log2 |S ∩ Σn| n .

Definition

Let S ⊆ Σ∗ be a string system. We shall say S is fully expressive if for every v ∈ Σ∗ there exist u, w ∈ Σ∗ such that uvw ∈ S.

Introduction 16 / 79

slide-17
SLIDE 17

We are interested in:

  • How does the capacity depend on the choice of duplication rules?
  • How does the capacity depend on the choice of seed string?
  • Which systems are fully expressive?
  • What is the connection between capacity and full expressiveness?

Introduction 17 / 79

slide-18
SLIDE 18

Some related previous work exists

Tandem duplication was studied in the context of formal languages:

  • Martín-Vide and Paun, Acta Cybernetica (1999):

Where are tandem-duplication languages located in the Chomsky hierarchy?

  • Dassow, Mitrana and Paun, Bull. of the EATCS (1999):

Binary tandem-duplication languages are regular.

  • Ming-Wei, Bull. of the EATCS (2000):

Non-binary tandem-duplication languages are irregular.

Introduction 18 / 79

slide-19
SLIDE 19

More related previous work exists

Tandem duplication was studied in an algorithmic context:

  • Main and Lorentz, J. Alg. (1984), Gusfield and Stoye, J. Comp. and

Systems Sci. (2004): How to efficiently find tandem duplications in a string.

  • Matroud, Hendy, and Tuffley, Nucleic Acids Research (2011):

How to efficiently find nested tandem duplications.

  • Elemento et al., Molecular Bio. and Evolution (2002), Lajoie et al.,
  • J. Comp. Biology (2007), Brejová et al., Phil. Trans. R. Soc. A (2014):

How to reconstruct the derivation process of a tandem-duplicated string.

Introduction 19 / 79

slide-20
SLIDE 20

End duplication has full capacity

Theorem

For Send

k

= S(Σ, s, T end

k

), |s| k, cap(Send

k

) = log2 |Σ| .

Assumption

The initial string s contains every symbols of Σ at least once.

End Duplication 20 / 79

slide-21
SLIDE 21

End duplication has full capacity (Cont.)

Proof.

: We obviously have, cap(Send

k

) = lim sup

n→∞

log2

  • Send

k

∩ Σn

  • n

lim sup

n→∞

log2 |Σn| n = log2 |Σ| .

End Duplication 21 / 79

slide-22
SLIDE 22

End duplication has full capacity (Cont.)

Proof.

: We claim that starting with any string s ∈ Σk, with each symbol appearing at least once, and any w = w1w2 . . . wk ∈ Σk, we can derive a string y with w as a suffix. Step I: Duplicate prefix. Assume s = uv, |u| = k, then s = uv ⇒ uvu = s′. Observation: Every symbol of Σ appears in the beginning and end of a k-substring of s′. Step II: Force w1 at the end.

k w1 w1

End Duplication 22 / 79

slide-23
SLIDE 23

End duplication has full capacity (Cont.)

Proof.

Step III: Force w1w2 at the end.

k w2 w1 w1 w2

⇒ and then

k w1w2 w1w2

⇒ Repeat Step III inductively to get w1w2 . . . wk as a suffix.

End Duplication 23 / 79

slide-24
SLIDE 24

End duplication has full capacity (Cont.)

Proof.

Step IV: Repeat previous steps to get every k-word from Σk as a substring. Thus, after at most 2k |Σ|k duplications we get a string s′′ containing all possible k-substrings, |s′′| 2k2 |Σ|k. For any n = |s′′| + tk we can now create |Σ|tk distinct strings. Hence, cap(Send

k

) = lim sup

n→∞

log2

  • Send

k

∩ Σn

  • n

lim sup

t→∞

log2(|Σ|tk) |s′′| + tk log2 |Σ| .

Corollary

Send

k

systems are fully expressive.

End Duplication 24 / 79

slide-25
SLIDE 25

Tandem duplication behaves differently

But first… Main tool – φk-transform domain. We assume WLOG that Σ = Zq.

Definition

We define the transform φk : Zk

q

→ Zk

q × Z∗ q by,

φk(x) = (Prefk(x), Suff|x|−k(x) − Pref|x|−k(x)), as well as ζi,k : Zk

q × Z∗ q → Zk q × Z∗ q,

ζi,k(x, y) = { (x, u0kw) if y = uw, |u| = i (x, y)

  • therwise,

where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.

Tandem Duplication 25 / 79

slide-26
SLIDE 26

Main tool - φk-transform domain

Lemma

The following diagram commutes: Zk

q Ttan

i,k

− − − − → Zk

q

  φk   φk Zk

q × Z∗ q ζi,k

− − − − → Zk

q × Z∗ q

i.e., for every string x ∈ Zk

q ,

φk(Ttan

i,k (x)) = ζi,k(φk(x)).

Tandem Duplication 26 / 79

slide-27
SLIDE 27

Main tool - φk-transform domain

Example

Assume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to 02123

Ttan

1,2

− − − − → 0212123   φ2   φ2 (02, 102)

ζ1,2

− − − − → (02, 10002) where the inserted elements are underlined.

Tandem Duplication 27 / 79

slide-28
SLIDE 28

Tandem duplication behaves differently

Theorem

For Stan

k

= S(Σ, s, T tan

k

), |s| k, cap(Stan

k

) = 0.

Proof.

In the φk-transform domain, φk(s) = (x, y), and tandem duplication becomes an insertion of 0k in the y-part. Thus, a tandem duplication

  • peration is equivalent to throwing k balls into a bin. There are at

most |y| = |s| − k + 1 bins. Thus, after t tandem-duplicated operations, there are at most (|s|−k+t

t

) (|s| − k + t)|s|−k outcomes. Thus, cap(Stan

k

) lim sup

t→∞

log2((|s| − k + t)|s|−k) |s| + tk = 0

Tandem Duplication 28 / 79

slide-29
SLIDE 29

Tandem duplication behaves differently

Corollary

Stan

k

systems are never fully expressive.

Proof.

If φk(s) = (x, y), then all possible mutations are limited (in the φk-transform domain) to (x, y′) with y′ being the same as y except for extra zeros. Thus, φ−1

k (x, y1) can never be obtained from s.

Tandem Duplication 29 / 79

slide-30
SLIDE 30

Were we too strict?

Definition (Tandem Duplication)

Ttan

i,k (x) =

{ uvvw if x = uvw, |u| = i, |v| = k x

  • therwise.

T tan

k =

{ Ttan

i,k′

  • i 0, k′ k

} . The lower-bounded tandem-duplication system is defined as Stan

k = S(Σ, s, T tan k ).

u v w u v v w Tandem Duplication 30 / 79

slide-31
SLIDE 31

Yes, we were! Here’s full expressiveness:

Theorem

Stan

k is fully expressive.

Proof.

Employ a similar procedure to generate each substring as in the proof for Send

k

, only each time copy a suffix of the string (from the chosen starting point, to the end).

Tandem Duplication 31 / 79

slide-32
SLIDE 32

What about full capacity?

Theorem

For any finite alphabet Σ, and s ∈ Σ∗, we have cap(Stan

1 ) log2(r + 1),

where r is the largest (real) root of the polynomial f(x) = x|Σ| −

|Σ|−2

i=0

xi. Proof Strategy: Find a set S ⊆ Stan

1 for which we can calculate the

  • capacity. But how?

Tandem Duplication 32 / 79

slide-33
SLIDE 33

Regular languages to the rescue

Definition (Recipe for a regular language)

  • A finite alphabet Σ
  • A finite directed labeled graph G = (V, E, L), with E ⊆ V × V in the

multiset sense, and L : E → Σ.

  • A starting state s ∈ V and a set of accepting states A ⊆ V.
  • If e1e2 . . . en is a directed path in G, it generates the word

L(e1)L(e2) . . . L(en).

  • The language represented by G, denoted S(G), is defined as the set
  • f all words generated by directed paths starting at s and ending

in A.

Tandem Duplication 33 / 79

slide-34
SLIDE 34

A simple example for a regular language

Example

Consider the following directed labeled graph G:

1

S(G) is the set of all binary strings where a 1 is followed by a 0.

Tandem Duplication 34 / 79

slide-35
SLIDE 35

Graphs have properties

Definition

Let G = (V, E, L) be a graph generating a regular language.

  • G is irreducible if for every v1, v2 ∈ V, there is a directed path

v1 v2.

  • G is primitive if it is irreducible and the gcd of all cycle lengths is 1.
  • G is lossless if for every v1, v2 ∈ V, and every word w ∈ Σ∗, there is

at most one path v1 v2 that generates w.

Tandem Duplication 35 / 79

slide-36
SLIDE 36

Counting paths is easy

Definition

For G = (V, E, L) define the adjacency matrix AG = (au,v) as the |V| × |V| matrix where au,v is the number of edges from u to v in G.

Observation

  • The number of paths u v of length n is exactly (An

G)u,v.

  • For a lossless graph G with one accepting state, i.e., A = {v}, we

have |S(G) ∩ Σn| = (An

G)s,v.

  • Thus (with the above setting),

cap(S(G)) = lim sup

n→∞

log2((An

G)s,v)

n .

Tandem Duplication 36 / 79

slide-37
SLIDE 37

Enter Perron and Frobenius

  • O. Perron
  • G. Frobenius

(Source: Wikipedia)

Theorem (Perron-Frobenius (Partial))

If G is a primitive graph then:

1 λ = λ(AG) max {|µ| : µ is an eigenvalue of AG} also

called the spectral radius of G, is an eigenvalue of AG.

2 There exist y, x > 0, unique (up to scalar multiplication)

left and right eigenvectors for λ.

3 If y · xT = 1, then

lim

n→∞

1 λn An

G = xT · y.

Corollary

For a primitive lossless graph G, cap(S(G)) = log2(λ(AG)).

Tandem Duplication 37 / 79

slide-38
SLIDE 38

Back to Stan

1 Proof.

Main Idea: Find a regular language that “resides” within Stan

1 and use

its capacity to lower bound cap(Stan

1 ).

Phase I: Denote the alphabet letters as a1, a2, . . . , a|Σ|. As in the proof

  • f full expressiveness, assume we reach a string with a|Σ| . . . a2a1 as a
  • suffix. From now on, we ignore everything except this suffix.

Phase II: Run in iterations. In iteration i, where i = |Σ| , |Σ| − 1, . . . , 3, 2 use tandem duplication only on strings of the form aiai−1 . . . a1. In the last iteration, tandem duplicate single letters. It is easy to verify the resulting strings form the following regular language, S = ( a+

|Σ|

( a+

|Σ|−1

( . . . ( a+

2

( a+

1

)+)+)+)+)+ .

Tandem Duplication 38 / 79

slide-39
SLIDE 39

Proof by sub-language (Cont.)

Proof.

S = ( a+

|Σ|

( a+

|Σ|−1

( . . . ( a+

2

( a+

1

)+)+)+)+)+ .

a|Σ| a|Σ|−1 a2 a1 a|Σ| a|Σ|−1 a2 a1 a1 a1 Tandem Duplication 39 / 79

slide-40
SLIDE 40

Proof by sub-language (Cont.)

Proof.

The graph is lossless, irreducible, and primitive. Its adjacency matrix is AG =      

1 1 1 1 1 1 ... ... 1 1 1 1 1 . . . 1 1

      , Thus, the number of paths of length n from the starting vertex to the accepting vertex grows exponentially as λn, where λ is the spectral radius of the graph, i.e., the largest root of χAG(λ) = det(λI − AG) = (λ − 1)|Σ| −

|Σ|−2

i=0

(λ − 1)i. Set x = λ − 1 and we obtain the result.

Tandem Duplication 40 / 79

slide-41
SLIDE 41

What do we have so far?

Capacity Type System Zero Partial Full Full Expressiveness End Send

k

− −

  • Send

k

− −

  • Tandem

Stan

k

− − Stan

k

− ? ?

  • Open Question

Find cap(Stan

k ) or improve the bounds on it.

Tandem Duplication 41 / 79

slide-42
SLIDE 42

Full capacity ⇔ full expressiveness?

Theorem

Let S be a string system over the alphabet Σ. If S has full capacity then S has full expressiveness.

Proof.

Assume to the contrary S never contains w ∈ Σk as a substring. Partition every word x ∈ S into blocks of length k (and perhaps a remainder block of length at most k − 1). Each block has at most |Σ|k − 1 choices, since w is forbidden. Thus, |S ∩ Σn| (|Σ|k − 1)⌊n/k⌋ · |Σ|k−1 . Then cap(S) log2(|Σ|k − 1) k < log2 |Σ| .

Tandem Duplication 42 / 79

slide-43
SLIDE 43

What about the other direction?

Example

Consider the following string system, S = {vv | v ∈ Σ∗} . It is obvious that S is fully expressive, but cap(S) = 1 2.

Open Question

This example is not a string-duplication system. What is the connection between full capacity and full expressiveness for string-duplication systems?

Tandem Duplication 43 / 79

slide-44
SLIDE 44

A bit more on the big picture…

Capacity Type System Zero Partial Full Full Expressiveness End Send

k

− −

  • Send

k

− −

  • Tandem

Stan

k

− − Stan

k

− ? ?

  • Palindromic

Spal

k

− ? ?

  • Interspersed

Sint

k,k′

  • ?
  • Open Question

Complete the missing pieces in this table.

Tandem Duplication 44 / 79

slide-45
SLIDE 45

Let’s add probability to the mix

Why?

  • Real biological processes are not always deterministic.
  • Just like Shannon vs. Hamming: it is interesting!

Case study:

  • Binary alphabet, Σ = {0, 1}, Duplication length k = 1.
  • The position to duplicate is chosen independently and uniformly.
  • Two options:
  • Stan

1

– Tandem duplication: bit b becomes bb.

  • Stan

1

– Complement tandem duplication: bit b becomes bb.

Pólya String Models 45 / 79

slide-46
SLIDE 46

Is this a Pólya urn model?

An urn contains B black balls and W white balls. At each step, a ball is extracted uniformly and independently from the urn. The ball is returned to the urn, together with another ball of the same color. The process repeats.

Crucial difference:

There is no string structure in a Pólya urn model.

Pólya String Models 46 / 79

slide-47
SLIDE 47

How would we define capacity?

Let S(i) denote the random variable whose value is the string after i mutations, and S(0) = s the seed string.

Definition

The probabilistic capacity of the process S is defined as capProb(S) = lim sup

n→∞

1 nH(S(n)), where H(S(n)) is the entropy of S(n), i.e., H(S(n)) = − ∑

w∈Σ∗

Pr(S(n) = w) log2 Pr(S(n) = w). The combinatorial capacity will be denoted by capComb.

Pólya String Models 47 / 79

slide-48
SLIDE 48

Not everything is uniformly distributed

Assume Stan

1

with S(0) = 0:

n = 0 : n = 1 : n = 2 : n = 3 : ε 01 1 011 21 0111 321 0101 231 0110 213 010 12 0110 312 0100 132 0101 123

  • Thus, Pr(S(3) = 0110) = 1

3 but Pr(S(3) = 0111) = 1 6.

Pólya String Models 48 / 79

slide-49
SLIDE 49

One simple connection exists

Lemma

For S ∈ { Stan

1 , Stan 1

} , capProb(S) capComb(S).

Proof.

H(S(n)) is maximized when S(n) is uniformly distributed, H(S(n)) log2

  • S ∩ Σ|S(0)|+n
  • .

Thus, capProb(S) = lim sup

n→∞

1 nH(S(n)) lim sup

n→∞

1 n log2

  • S ∩ Σ|S(0)|+n
  • = capComb(S).

Pólya String Models 49 / 79

slide-50
SLIDE 50

So for tandem duplication…

Corollary

For any S(0) we have capProb(Stan

1 ) = 0.

Proof.

We obviously have capProb(Stan

1 ) 0. Additionally,

capProb(Stan

1 ) capComb(Stan 1 ) = 0,

which we already proved.

Pólya String Models 50 / 79

slide-51
SLIDE 51

Complement-tandem duplication is harder

Assume S(0) = 0 for simplicity. Let us record the history of mutations in a string, whose ith position equals j if the jth mutation caused the ith symbol.

Example

0 → 01 → 010 → 0110, ε → 1 → 12 → 312.

Observation

1 History is a permutation. 2 Each permutation is equally likely.

Pólya String Models 51 / 79

slide-52
SLIDE 52

Here it is again

Assume Stan

1

with S(0) = 0:

n = 0 : n = 1 : n = 2 : n = 3 : ε 01 1 011 21 0111 321 0101 231 0110 213 010 12 0110 312 0100 132 0101 123

  • Some histories results in the same mutated string.

Pólya String Models 52 / 79

slide-53
SLIDE 53

It’s all in the signature

Definition

The signature of a permutation π ∈ Sn, is a binary string w = w1w2 . . . wn−1, where wi = { π(i) > π(i + 1), 1 π(i) < π(i + 1).

Theorem

Consider Stan

1

with S(0) = 0. Then Pr(S(n) = 01w) is the same as the probability of getting the signature w when choosing a permutation from Sn (uniformly).

Pólya String Models 53 / 79

slide-54
SLIDE 54

It’s all in the signature – Proof

Proof.

Assuming w ∈ {0, 1}n−1, some notation first:

1 Π01w – The set of history permutations that lead to a mutated

string 01w.

2 Ψw – The set of permutations from Sn with signature w. 3 For any string v ∈ {0, 1}ℓ, the set of positions where 0 is preceded

by a 1 (including possible edges) Tv = {i ∈ [ℓ + 1] : (vi−1 = 1 or i = 1) and (vi = 0 or i = ℓ + 1)} Example: for v = 0011010 we have Tv = {1, 5, 7}.

Pólya String Models 54 / 79

slide-55
SLIDE 55

It’s all in the signature – Proof (Cont.)

Proof.

Strategy: Prove |Π01w| = |Ψw| by showing both expressions have the same recursion with the same starting conditions. Starting conditions: Trivially |Π01ε| = |Ψε| = 1. Recursion for Ψw: Given w ∈ {0, 1}n−1, we can recursively construct a permutation π ∈ Sn with signature w by picking π−1(n), which can

  • nly be some i ∈ Tw. We then recursively construct two permutations,

with signatures w1...i−2 and wi...n−1. Thus, |Ψw| = ∑

i∈Tw

(n − 1 i − 1 ) Ψw1...i−2

  • Ψwi+1...n−1
  • .

Pólya String Models 55 / 79

slide-56
SLIDE 56

It’s all in the signature – Proof (Cont.)

Proof.

Recursion for Π01w: Given w ∈ {0, 1}n−1, consider a history permutation π ∈ Sn resulting in the mutated sequence 01w. Obviously π−1(1) is a position of a bit 1 in 01w which is last in a run, i.e., followed by a 0 or last in the string. Thus, pick π−1(1), and construct the rest of the permutation recursively using w1...i−2 and wi...n−1. Thus, |Π01w| = ∑

i∈Tw

(n − 1 i − 1 ) Π01w1...i−2

  • Π10wi+1...n−1
  • .

Pólya String Models 56 / 79

slide-57
SLIDE 57

Last time I’m showing this slide

Assume Stan

1

with S(0) = 0:

n = 0 : n = 1 : n = 2 : n = 3 : ε 01 1 011 21 0111 321 0101 231 0110 213 010 12 0110 312 0100 132 0101 123

  • Open Question

Find a nice bijection between Π01w and Ψw.

Pólya String Models 57 / 79

slide-58
SLIDE 58

And now, the capacity

Theorem

For Stan

1

with S(0) = 0, 0.7213 ≈ log2(e) 2 capProb(Stan

1 ) H2

(1 3 ) ≈ 0.9183, where H2(x) −x log2(x) − (1 − x) log2(1 − x) is the binary entropy function.

Pólya String Models 58 / 79

slide-59
SLIDE 59

Proof of the bounds

Proof.

Consider the real random variables X1, X2, . . . , chosen i.i.d., uniformly from [0, 1]. Sorting Xn

1 X1, X2, . . . , Xn generates a random permutation

(due to symmetry, chosen uniformly from Sn).Define Qi { 1 Xi < Xi+1 Xi > Xi+1, (except for a 0-measure undefined set). So Qn−1

1

is a signature of a uniformly chosen random permutation from Sn.

Pólya String Models 59 / 79

slide-60
SLIDE 60

Proof of the bounds (Cont.)

Proof.

We now have Pr(S(n) = 01w) = Pr(Qn−1

1

= w). and capProb(Stan

1 ) = lim sup n→∞

1 nH(S(n)) = lim sup

n→∞

1 nH(Qn−1

1

) = lim sup

n→∞

1 n

n−1

i=1

H(Qi | Qi−1

1

).

Pólya String Models 60 / 79

slide-61
SLIDE 61

Proof of the bounds (Cont.)

Proof.

Lower bound: Since Qi−1

1

→ Xi → Qi we have H(Qi | Qi−1

1

) H(Qi | Xi). Furthermore, Pr(Qi = 0 | Xi = x) = x. Thus, capProb(Stan

1 ) = lim sup n→∞

1 n

n−1

i=1

H(Qi | Qi−1

1

) lim sup

n→∞

1 n

n−1

i=1

H(Qi | Xi) = H(Q1 | X1) = ∫ 1 H2(x)dx = log2(e) 2 .

Pólya String Models 61 / 79

slide-62
SLIDE 62

Proof of the bounds (Cont.)

Proof.

Upper bound: capProb(Stan

1 ) = lim sup n→∞

1 n

n−1

i=1

H(Qi | Qi−1

1

) lim sup

n→∞

1 n

n−1

i=2

H(Qi | Qi−1) = H(Q2 | Q1) = 1 2 (H(Q2 | Q1 = 0) + H(Q2 | Q1 = 1)) = H2 (1 3 ) , since Pr(Q2 = 0 | Q1 = 0) = ∫ 1

0 dx1

∫ x1

0 dx2

∫ x2

0 dx3

∫ 1

0 dx1

∫ x1

0 dx2

= 1/6 1/2 = 1 3, and similarly for Pr(Q2 = 1 | Q1 = 1).

Pólya String Models 62 / 79

slide-63
SLIDE 63

Probabilistic = Combinatorial

Observation

capProb(Stan

1 ) H2

(1 3 ) < 1 = capComb(Stan

1 ).

Open Questions

1 Find capProb(Stan 1 ). 2 We know nothing for duplication length 2.

Pólya String Models 63 / 79

slide-64
SLIDE 64

Moving on to error correction

An error-correcting code has two main components:

1 An error ball: Its size and shape depend on

the kind of errors the channel induces.

2 An error ball: Its size and shape depend on

the kind of errors the channel induces.

3 A packing of error balls: Its density affects

communication efficiency. Its structure affects ease of encoding/decoding.

Error-Correcting Codes 64 / 79

slide-65
SLIDE 65

Let us recall the scenario

  • Information is stored in the DNA of some bacteria.
  • The bacteria mutate over time.
  • When the information is read, the DNA has gone through a

(perhaps unbounded) number of duplications.

Goal

Protect information against duplication errors!

Case study

We focus on Stan

k

– tandem duplication with fixed duplication length k.

Error-Correcting Codes 65 / 79

slide-66
SLIDE 66

Some definitions are required

Definition

If v ∈ Stan

k

(Σ, u, T tan

k

), we denote it as u = ⇒∗

k v. We

say u is an ancestor of v, and v is a descendant of

  • u. We define the descendant cone of u as

D∗

k(u) =

{ v ∈ Σ∗ : u

= ⇒

k v

} , and the ancestor cone as A∗

k(u) =

{ v ∈ Σ∗ : v

= ⇒

k u

} .

A∗

k (u)

D∗

k (u)

u Time

Error-Correcting Codes 66 / 79

slide-67
SLIDE 67

Now we define a code

Definition

An (n, M; ∗)k code C is a subset C ⊆ Σn of size |C| = M, such that for each u, v ∈ C, u = v, D∗

k(u) ∩ D∗ k(v) = ∅.

The decoding problem

Given an (n, M; ∗)k code C, and a (mutated) word v ∈ Σ∗, find Decode(v) = A∗

k(v) ∩ C.

Error-Correcting Codes 67 / 79

slide-68
SLIDE 68

Reminder – The φk-transform

We assume WLOG that Σ = Zq.

Definition

We define the transform φk : Zk

q

→ Zk

q × Z∗ q by,

φk(x) = (Prefk(x), Suff|x|−k(x) − Pref|x|−k(x)), as well as ζi,k : Zk

q × Z∗ q → Zk q × Z∗ q,

ζi,k(x, y) = { (x, u0kw) if y = uw, |u| = i (x, y)

  • therwise,

where Prefi(x) and Suffi(x) are, respectively, the i-prefix and i-suffix of x.

Error-Correcting Codes 68 / 79

slide-69
SLIDE 69

Main tool - φk-transform domain

Lemma

The following diagram commutes: Zk

q Ttan

i,k

− − − − → Zk

q

  φk   φk Zk

q × Z∗ q ζi,k

− − − − → Zk

q × Z∗ q

i.e., for every string x ∈ Zk

q ,

φk(Ttan

i,k (x)) = ζi,k(φk(x)).

Error-Correcting Codes 69 / 79

slide-70
SLIDE 70

Main tool - φk-transform domain

Example

Assume Σ = Z4. Starting with 02123 and letting i = 1 and k = 2 leads to 02123

Ttan

1,2

− − − − → 0212123   φ2   φ2 (02, 102)

ζ1,2

− − − − → (02, 10002) where the inserted elements are underlined.

Error-Correcting Codes 70 / 79

slide-71
SLIDE 71

The ancestors are the key component

A∗

k (u)

D∗

k (u)

u Time Rk(u)

Definition

If A∗

k(v) = {v} we say v is irreducible. The set of

irreducible words is denoted Irrk. The roots of v ∈ Σ∗ are defined by Rk(v) = A∗

k(v) ∩ Irrk.

Lemma

For tandem duplication of length k, and every v ∈ Σ∗, |Rk(v)| = 1. Already proved by Leupold et al. (2005). We give a different proof, using φk, enabling a code construction.

Error-Correcting Codes 71 / 79

slide-72
SLIDE 72

Proof of root uniqueness

Proof.

Denote φk = (x, y), and y = 0m0y10m1y20m2 . . . 0mt−1yt0mt, where yi = 0 for all i. Any ancestor v′ ∈ A∗

k(v) must be of the form,

φk(v′) = (x, 0m0−i0ky10m1−i1ky20m2−i2k . . . 0mt−1−it−1kyt0mt−itk), and it is irreducible if and only if φk(v′) = (x, 0m0 mod ky10m1 mod ky20m2 mod k . . . 0mt−1 mod kyt0mt mod k), giving a unique root.

Error-Correcting Codes 72 / 79

slide-73
SLIDE 73

Disjoint descendant cones are simple

Corollary

D∗

k(u) ∩ D∗ k(v) = ∅ if and only if Rk(u) = Rk(v).

Proof.

⇒: If w ∈ D∗

k(u) ∩ D∗ k(v) then

Rk(u)

= ⇒

k u ∗

= ⇒

k w

and Rk(v)

= ⇒

k v ∗

= ⇒

k w,

and since the root of w is unique, Rk(u) = Rk(v).

Error-Correcting Codes 73 / 79

slide-74
SLIDE 74

Disjoint cones proof (Cont.)

Proof.

⇐: If Rk(u) = Rk(v) then denote φk(Rk(u)) = φk(Rk(v)) = (x, 0m0y10m1y20m2 . . . 0mt−1yt0mt). Then, φk(u) = (x, 0m′

0y10m′ 1y20m′ 2 . . . 0m′ t−1yt0m′ t)

φk(v) = (x, 0m′′

0 y10m′′ 1 y20m′′ 2 . . . 0m′′ t−1yt0m′′ t ).

Define w ∈ Σ∗ such that φk(w) = (x, 0max(m′

0,m′′ 0 )y10max(m′ 1,m′′ 1 ) . . . 0max(m′ t−1,m′′ t−1)yt0max(m′ t,m′′ t )),

which immediately shows u = ⇒∗

k w and v =

⇒∗

k w.

Error-Correcting Codes 74 / 79

slide-75
SLIDE 75

Putting it all together

Theorem

  • v ∈ Irrk iff φk(v) = (x, y) and y is (0, k − 1)-RLL.
  • Irrk ∩Σn is an (n, M; ∗)k-code.
  • Decoding v ∈ Σ∗ may be done in linear time by:

1 Finding φk(v) = (x, y). 2 Reducing runs of 0’s in y modulo k to obtain y′. 3 Returning the answer φ−1

k (x, y′).

Observation

The code may be further enlarged (and made optimal!) by carefully adding shorter RLL sequences.

Error-Correcting Codes 75 / 79

slide-76
SLIDE 76

Other results

  • Stan

3 : forms a regular language, unique root, positive (though not

full) capacity, not fully expressive.3

  • Unique root exists in several other cases, enabling code

construction and decoding.3

Theorem

Let Σ = ∅ be an alphabet, and U ⊆ N, U = ∅, a set of tandem-duplication

  • lengths. Denote k = min(U). Then (Σ, U) is a unique-root pair if and only

if it matches one of the following cases:

|Σ| = 1 U ⊆ kN |Σ| = 2 U = {k} U ⊇ {1, 2} |Σ| 3 U = {k} U = {1, 2} U = {1, 2, 3}

3Jain et al., IEEE Trans. on Inform. Th. 2017. Conclusion 76 / 79

slide-77
SLIDE 77

Other results

  • What is the longest duplication distance to the root (in

unbounded tandem duplication)? Apparently for length n sequences it is Θ(n) in the worst (and common!) case.4

  • In the probabilistic models we know also the capacity of end

duplication, as well as a mix duplication and complement duplication – but only for duplication length k = 1.5

  • Tandem duplication with point-mutation (substitution) has more

capacity and expressiveness, but requires more care when constructing error-correcting codes.6

4Alon et al., ISIT 2016. 5Elishco et al., ISIT 2016. 6Jain et al., ISIT 2017. Conclusion 77 / 79

slide-78
SLIDE 78

Many open questions remain!

Open Questions

  • Study error-correcting codes for duplication models other than

tandem duplication.

  • Find error-correcting codes for a probabilistic channel, correcting

typical errors.

  • Study a mix of duplication and other mutations (substitutions,

insertions/deletions).

  • Study error models which are context sensitive.
  • For the biologists: Find out the channel parameters in the real

world.

Conclusion 78 / 79

slide-79
SLIDE 79

Thank You

Conclusion 79 / 79