from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi - - PowerPoint PPT Presentation

β–Ά
from their substrings
SMART_READER_LITE
LIVE PREVIEW

from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi - - PowerPoint PPT Presentation

Reconstruction of Strings from their Substrings Spectrum Sagi Marcovich, Eitan Yaakobi Technion Israel Institute of Technology Full version: https://arxiv.org/abs/1912.11108 Background picture from


slide-1
SLIDE 1

Reconstruction of Strings from their Substrings Spectrum

Sagi Marcovich, Eitan Yaakobi

Technion – Israel Institute of Technology

Full version: https://arxiv.org/abs/1912.11108

Background picture from https://www.hpcwire.com/2019/09/20/dna-data-storage-innovation-reduces-write-times-boosts-density

slide-2
SLIDE 2

DNA-based Storage

DNA Based Storage

2

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

slide-3
SLIDE 3

DNA Storage System

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

3

slide-4
SLIDE 4

DNA Shotgun Sequencing

  • Accurate reading of DNA strands is limited to small lengths
  • The information of the string is provided by a list of its substrings of fixed length 𝑀.
  • 𝑀-multispectrum.
  • The substrings are assembled to reconstruct the strand.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

4

slide-5
SLIDE 5

Multispectrum Reconstruction Previous Work

  • Several recent papers this reconstruction problem
  • Different reading setups and various error models.
  • Those include:

1.

  • R. Arratia, D. Martin, G. Reinert, and M. Waterman, β€œPoisson process approximation for sequence repeats, and sequencing

by hybridization,” Journal of Computational Biology : a Journal of Computational Molecular Cell Biology, vol. 3, pp. 425–463, 1996. 2.

  • A. S. Motahari, G. Bresler, and D. Tse, β€œInformation theory of DNA shotgun sequencing,” IEEE Transactions on Information

Theory, vol. 59, no. 10, pp. 6273–6289, 2013. 3.

  • A. S. Motahari, K. Ramchandran, D. Tse, and N. Ma, β€œOptimal DNA shotgun sequencing: Noisy reads are as good as noiseless

reads,” in Proc. of the IEEE International Symposium of Information Theory, Istanbul, Turkey, 2013, pp. 1640–1644. 4.

  • S. Ganguly, E. Mossel, and M. Racz, β€œSequence assembly from corrupted shotgun reads,” in Proc. of the IEEE International

Symposium of Information Theory, Barcelona, Spain, 2016, pp. 265–269. 5.

  • G. Bresler, M. Bresler, and D. Tse, β€œOptimal assembly for high throughput shotgun sequencing,” BMC Bioinformatics, vol. 14,

2013. 6.

  • I. Shomorony, T. Courtade, and D. Tse, β€œDo read errors matter for genome assembly?” in Proc. of the IEEE International

Symposium of Information Theory, Hong Kong, 2015, pp. 919–923. 7.

  • I. Shomorony, G. Kamath, F. Xia, T. Courtade, and D. Tse, β€œPartial DNA assembly: A rate-distortion perspective,” in Proc. of

the IEEE International Symposium of Information Theory, Barcelona, Spain, 2016, pp. 1799–1803. 8.

  • R. Gabrys and O. Milenkovic, β€œUnique reconstruction of coded sequences from multiset substring spectra,” in Proc. of the

IEEE International Symposium on Information Theory, Vail, Colorado, USA, 2018, pp. 2540–2544.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

5

slide-6
SLIDE 6

Basic Definitions

  • Frequently used: w = (π‘₯1, … , π‘₯π‘œ) ∈ Ξ£π‘œ.
  • Substring: π‘₯𝑗,𝑙 = (π‘₯𝑗, … , π‘₯𝑗+π‘™βˆ’1) .
  • 𝑙-prefix : π‘₯1,𝑙, 𝑙-suffix : π‘₯π‘œβˆ’π‘™+1,𝑙.
  • 𝑀-multispectrum of π‘₯ is the multiset: 𝑇𝑀 π‘₯ = {π‘₯1,𝑀, π‘₯2,𝑀, … , π‘₯π‘œβˆ’π‘€+1,𝑀}
  • π‘₯ is called 𝑀-reconstructible if it can be uniquely reconstructed from 𝑇𝑀(π‘₯).

𝑦 = 0100000111011111 𝑇8 𝑦 01000001 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111

8-multispectrum reconstruct

0100000111011111 = 𝑦

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

6

slide-7
SLIDE 7

Problem Definition

  • In many cases the 𝑀-multispectrum can not be read error free.

01000001 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111 Lossy Multispectrum 01000001 00000111 00001110 00011101 01110111 11101111 01000001 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111 Erroneous Multispectrum 01010001 10000011 00000111 00001110 00011000 00111011 01110111 11001111 11011111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

7

slide-8
SLIDE 8

Lossy Multispectrum

  • A multiset 𝑉 is a 𝑒-losses 𝑀-multispectrum of π‘₯ if 𝑉 βŠ† 𝑇𝑀(π‘₯) and 𝑇𝑀 π‘₯

βˆ’ 𝑉 ≀ 𝑒.

  • 𝐢𝑀,𝑒(π‘₯) consists of all the 𝑒-losses 𝑀-multispectrums of π‘₯.
  • Maximal reconstructible substring 𝑋

1 𝑉 ,

  • Because of the losses, entries from the start or the end of π‘₯ can be absent.
  • 𝑋

1(𝑉) is the largest consecutive substring of π‘₯ contained in 𝑉. 𝑋 1 𝑉

β‰₯ π‘œ βˆ’ 𝑒.

  • π‘₯ is (𝑀, 𝑒)-reconstructible if its maximal reconstructible substring 𝑋

1(𝑉) can be uniquely

reconstructed from any 𝑉 ∈ 𝐢𝑀,𝑒(π‘₯).

𝐢𝑀,𝑒(π‘₯)

𝑒-losses

𝑦 = 0100000111011111 𝑉 01000001 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111

3-losses 8-multispectrum reconstruct

10000011101111 = 𝑋

1(𝑉)

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

8

slide-9
SLIDE 9

𝑀-Reconstructible and 𝑀-Substring Unique

  • A string π‘₯ ∈ Ξ£n is called 𝑀-substring unique if for every 1 ≀ 𝑗 < π‘˜ ≀ π‘œ βˆ’ 𝑀 + 1,

π‘₯𝑗,𝑀, β‰  π‘₯

π‘˜,𝑀.

  • Theorem.[1] For 𝑀 β‰₯ 𝑏 log π‘œ

where 𝑏 > 1, the asymptotic rate of the set of 𝑀- substring unique strings approaches 1.

  • Theorem.[2] If 𝑦 is (𝑀 βˆ’ 1)-substring unique then it is 𝑀-reconstructible.
  • The first 𝑀-substring satisfies that its 𝑀 βˆ’ 1 -prefix appears once in 𝑇𝑀(𝑦).
  • Similarly the 𝑀 βˆ’ 1 -suffix of the last 𝑀-substring.
  • Every other 𝑀 βˆ’ 1 -prefix or suffix appears twice.

[1] O. Elishco, R. Gabrys, M. Medard, and E. Yaakobi, β€œRepeat free codes,” in Proc. of the IEEE International Symposium of Information Theory, Paris, France, 2019, pp. 932–936. [2] E. Ukkonen, β€œApproximate string-matching with q-grams and maximal matches,” Theoretical Computer Science, vol. 92, no. 1, pp. 191–211, 1992.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

9

slide-10
SLIDE 10

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 1 – Find a 7-prefix that appears once

𝑇8 𝑦 01000001 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111

𝑦 =

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

10

slide-11
SLIDE 11

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 10000011 00000111 00001110 00011101 00111011 01110111 11101111 11011111

𝑦 = 01000001

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

11

slide-12
SLIDE 12

Basic Stitching

𝑇8 𝑦 00000111 00001110 00011101 00111011 01110111 11101111 11011111

𝑦 = 010000011

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

12

slide-13
SLIDE 13

Basic Stitching

𝑇8 𝑦 00001110 00011101 00111011 01110111 11101111 11011111

𝑦 = 0100000111

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

13

slide-14
SLIDE 14

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 00011101 00111011 01110111 11101111 11011111

𝑦 = 01000001110

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

14

slide-15
SLIDE 15

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 00111011 01110111 11101111 11011111

𝑦 = 010000011101

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

15

slide-16
SLIDE 16

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 01110111 11101111 11011111

𝑦 = 0100000111011

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

16

slide-17
SLIDE 17

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 11101111 11011111

𝑦 = 01000001110111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

17

slide-18
SLIDE 18

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Step 2 – Find a 7-prefix that matches the current 7-suffix

𝑇8 𝑦 11011111

𝑦 = 010000011101111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

18

slide-19
SLIDE 19

Basic Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Done.

𝑇8 𝑦

𝑦 = 0100000111011111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

19

slide-20
SLIDE 20

Lossy Multispectrum

20

slide-21
SLIDE 21

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost

𝑦 =

𝑉 01000001 10000011 00000111 00001110 00111011 01110111 11101111 11011111 00011101

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

21

slide-22
SLIDE 22

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost
  • We’re Stuck! Attempt to continue..

𝑉

𝑦 = 01000001110

00111011 01110111 11101111 11011111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

22

slide-23
SLIDE 23

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost
  • Step 1 – Find a prefix that appears once

𝑉

𝑧1 = 01000001110

00111011 01110111 11101111 11011111

𝑧2 =

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

23

slide-24
SLIDE 24

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost
  • Step 2 – Find a prefix that matches the current suffix

𝑉

𝑧1 = 01000001110 𝑧2 = 00111011

01110111 11101111 11011111

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

24

slide-25
SLIDE 25

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost
  • Step 3 - ?

𝑉

𝑧1 = 01000001110 𝑧2 = 00111011111

  • What happens if we reduce the overlapping window length?

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

25

slide-26
SLIDE 26

01000001110 00111011111

More Advanced Stitching

  • 𝑦 = 0100000111011111 is 7-substring unique
  • 𝑇8 𝑦 = {01000001, 10000011, 00000111, 00001110, 00011101, 00111011,

01110111, 11101111, 11011111}

  • Assume now that the substring at position 5 is lost
  • Step 3 – reduce overlapping window length
  • The unique matching of suffix and prefix of length β„“ is ensured only if 𝑦 is β„“-substring unique
  • Luckily for us, 𝑦 is also 6-substring unique.

𝑦 = 0100000111011111 𝑧1 = 𝑧2 =

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

26

slide-27
SLIDE 27

Stitch(𝐡, 𝜍) Algorithm

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

The suffix-prefix matching window goes from 𝑀 βˆ’ 1 to 𝑀 βˆ’ 𝜍 βˆ’ 1. At each iteration we reconstruct larger substrings When the current substring can’t be stitched, it is put in the output set of the iteration.

27

slide-28
SLIDE 28

Reconstruct Using Stitching only

  • Let 𝑦 be a string that is (𝑀 βˆ’ 𝑒 βˆ’ 1)-substring unique, and 𝑉 ∈ 𝐢𝑀,𝑒(𝑦).
  • If we apply Stitch 𝑉, 𝑒
  • At iteration 𝑙 = 1 … 𝑒, the algorithm bridges all gaps of 𝑙 losses!
  • Hence, the output is 𝑋

1(𝑉).

  • A similar method was proposed in [2] by Gabrys and Milenkovic.
  • Their goal was to overcome bursts of losses of fixed length.
  • However, in our case this is not optimal..

[2] R. Gabrys and O. Milenkovic, β€œUnique reconstruction of coded sequences from multiset substring spectra,” in Proc. of the IEEE International Symposium

  • n Information Theory, Vail, Colorado, USA, 2018, pp. 2540–2544.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

28

slide-29
SLIDE 29

LREC Reconstruction

  • After 𝑒/3 iterations:
  • There are at most 3 output substrings!
  • Assume otherwise, and observe the gaps between 4 substrings
  • One of 𝑕1, 𝑕2, 𝑕3 is at most 𝑒/3 , and was stitched at previous iteration.
  • Thus, we require that 𝑦 is (𝑀 βˆ’ 𝑒/3 βˆ’ 1)-substring unique!

𝑕1 𝑕2 𝑕3

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

29

slide-30
SLIDE 30

LREC Reconstruction

  • After 𝑒/3 iterations:
  • If not successful, there are only two possible cases:
  • Two substrings left with gap of at least 𝑒/3
  • Three substrings left with gaps of at least 𝑒/3
  • We apply constraints to the start and end of 𝑦 to eliminate those cases.

𝑧1 𝑧2

𝑕

𝑧1 𝑧3

𝑕2 𝑕1

𝑧2

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

𝑀 βˆ’ 𝑒 βˆ’ 1 𝑀 βˆ’ ⌈2𝑒/3βŒ‰ βˆ’ 1 30

slide-31
SLIDE 31

LREC Constraints

1. 𝑦 is β„“1-substring unique. 2. The first and last 𝑒 + 1 length-β„“2 substrings of π’š are not identical to all other length-β„“2 substrings of π’š. 3. The first 𝑒 + 1 length-β„“3 substrings of π’š are not identical to the last 𝑒 + 1 length- β„“3 substrings of π’š.

β„“1 β„“2 β„“3

  • r

Constraint 1 Constraint 2 Constraint 3 β„“1 = 𝑀 βˆ’ 𝑒/3 βˆ’ 1 β„“2 = 𝑀 βˆ’ 2𝑒/3 βˆ’ 1 β„“3 = 𝑀 βˆ’ 𝑒 βˆ’ 1

  • We apply constraints 2 and 3 to first and last 𝑒 + 1 substrings, since the first or last 𝑒 substrings

can be lost from the multispectrum.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

31

slide-32
SLIDE 32

Cardinality Analysis LREC Constraints

  • πΈπ‘œ(𝑀, 𝑒) – the set of strings that satisfy π‘œ, 𝑀, 𝑒 -LREC Constraints.
  • Let 𝑒 = 𝑐 log(π‘œ) + 𝑝(log π‘œ ) for 0 ≀ 𝑐 ≀ 3.
  • Theorem. For 𝑀 β‰₯ 𝑏 log π‘œ

+ 𝑒/3 + 1 where 𝑏 > 1 + 𝑐/3, the asymptotic rate

  • f πΈπ‘œ(𝑀, 𝑒) approaches 1.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

32

slide-33
SLIDE 33

Cardinality Analysis LREC Constraints

  • The following compares the minimal 𝑀 for rate-1 reconstruction, where 𝑏 > 1.

LREC Reconstruction Only Stitching 𝑒 = 𝑝(log π‘œ ) 𝑀 β‰₯ 𝑏 log π‘œ + 𝑒/3 + 1 𝑀 β‰₯ 𝑏 log π‘œ + 𝑒 + 1 𝑒 = 𝑐 log(π‘œ) + 𝑝 log π‘œ , 0 < 𝑐 ≀ 3 𝑀 β‰₯ 𝑏 log π‘œ + 2𝑒/3 + 1

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

33

slide-34
SLIDE 34

Erroneous Multispectrum

34

slide-35
SLIDE 35

Erroneous Multispectrum

  • A multiset 𝑉 = {𝑣1, … , π‘£π‘œβˆ’π‘€+1} is a (𝑒, 𝑑)-erroneous 𝑀-multispectrum of π‘₯ if
  • 𝐽𝑓 𝑉 = {𝑗1, … , 𝑗𝑛} a set of indices, 𝑛 ≀ 𝑒.
  • For every 𝑗 ∈ 𝐽𝑓 𝑉 , 𝑒𝐼 𝑣𝑗, π‘₯𝑗,𝑀 ≀ 𝑑 (erroneous substrings), otherwise 𝑣𝑗 = π‘₯𝑗,𝑀 (correct substrings).
  • 𝐢𝑀,𝑒,𝑑(π‘₯) consists of all the (𝑒, 𝑑)-erroneous 𝑀-multispectrums of π‘₯.
  • Maximal reconstructible substring 𝑋

2 𝑉 ,

  • The majority agreement for each entry of π‘₯ in 𝑉.
  • π‘₯ is (𝑀, 𝑒, 𝑑)-reconstructible if 𝑋

2(𝑉) can be uniquely reconstructed from any 𝑉 ∈ 𝐢𝑀,𝑒,𝑑(π‘₯).

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

35

slide-36
SLIDE 36

(𝑀, 𝑒)-substring Distant

  • A string π‘₯ ∈ Ξ£π‘œ is called (𝑀, 𝑒)-substring distant if the Hamming distance of its 𝑀-

multispectrum is at least 𝑒, that is, 𝑒𝐼 𝑇𝑀 π‘₯ β‰₯ 𝑒.

  • When 𝑒 = 1, π‘₯ is 𝑀-substring unique.
  • Theorem. If π‘₯ is (𝑀 βˆ’ 1,4𝑑 + 1)-substring distant then it is (𝑀, 𝑒, 𝑑)-

reconstructible.

  • We generalize stitching method to errors!

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

36

slide-37
SLIDE 37

Reconstruction Algorithm

  • Recieves 𝑉 ∈ 𝐢𝑀,𝑒,𝑑(𝑦) for 𝑦 that is (𝑀 βˆ’ 1,4𝑑 + 1)-substring distant, outputs 𝑋

2(𝑉).

  • Use the substring-distant property of 𝑦 to identify the order of the substrings of 𝑉.
  • Take for each entry of 𝑦 the majority vote of its occurrences.

β‰₯ 4𝑑 + 1

𝑦1,π‘€βˆ’1 𝑦𝑗,π‘€βˆ’1 π‘¦π‘˜,π‘€βˆ’1 π‘¦π‘œβˆ’π‘€+2,π‘€βˆ’1

π‘‡π‘€βˆ’1(𝑦)

≀ 𝑑 ≀ 𝑑 ≀ 𝑑

β‹― β‹―

Suffπ‘€βˆ’1(π‘£π‘—βˆ’1) Prefπ‘€βˆ’1(𝑣𝑗) ≀ 2𝑑 Prefπ‘€βˆ’1(π‘£π‘˜) ≀ 𝑑 β‰₯ 2𝑑 + 1

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

37

slide-38
SLIDE 38

Cardinality Analysis of (𝑀, 𝑒)-Substring Distant Strings

  • Let π‘‡π‘œ(𝑀, 𝑒) denote the set of (𝑀, 𝑒)-substring distant strings of length π‘œ.
  • Theorem. For fixed 𝑒, 𝑏 > 1 and 𝑀 β‰₯ βŒˆπ‘ log π‘œ βŒ‰, the asymptotic rate of π‘‡π‘œ(𝑀, 𝑒)

approaches 1.

  • Theorem. For 𝑀 β‰₯ 2 log(π‘œ) + 𝑒 βˆ’ 1 log log π‘œ + 𝑃(1) and π‘œ large enough, it

holds that red π‘‡π‘œ 𝑀, 𝑒 ≀ 1.

  • Proofs are present in the full version of this paper [4].

[4] S. Marcovich and E. Yaakobi, β€œReconstruction of strings from their substrings spectrum,” Arxiv: https://arxiv.org/abs/1912.11108 , 2019.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

38

slide-39
SLIDE 39

Encoding (𝑀, 𝑒)-Substring Distant Strings

  • We developed an encoding and decoding schemes that use a single redundancy

bit for 𝑀 = 2 log(π‘œ) + 2 𝑒 βˆ’ 1 log log π‘œ + 4.

  • Far only by roughly 𝑒 βˆ’ 1 log log π‘œ from the lower bound.
  • Encoding algorithm is based on two procedures: elim

limin ination and expansion.

  • Elim

limination – repeatedly look for substrings with Hamming distance of less than 𝑒, remove one of them and encode the occurrence.

  • Expansion - enlarge the string by adding small substrings while maintaining the

(𝑀, 𝑒)-substring distant property.

  • Full Algorithm – in [4].

[4] S. Marcovich and E. Yaakobi, β€œReconstruction of strings from their substrings spectrum,” Arxiv: https://arxiv.org/abs/1912.11108 , 2019.

Introduction | The Reconstruction Problem | Lossy Multispectrum | Erroneous Multispectrum

39

slide-40
SLIDE 40

Summary

DNA Shotgun Sequencing – string is assembled based on its substrings multispectrum. Noisy Multispectrum - in some cases it is not possible to read the multispectrum error free. Lossy Multispectrum – reconstruction using LREC constraints improves known solutions, with rate approaches 1 for 𝑀 β‰₯ 𝑏 log π‘œ + 𝑒/3 + 1 where 𝑏 > 1 + 𝑐/3 and 𝑒 = 𝑐 log(π‘œ) + 𝑝(log π‘œ ). Erroneous Multispectrum – reconstruction algorithm based on substring distant strings. A string is 𝑀, 𝑒, 𝑑 -reconstructible if it is (𝑀 βˆ’ 1,4𝑑 + 1)-substring distant.

Thank You!

40