Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
String Reconstruction Problems in Multiomics Data Analysis Olgica - - PowerPoint PPT Presentation
String Reconstruction Problems in Multiomics Data Analysis Olgica - - PowerPoint PPT Presentation
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! String Reconstruction Problems in Multiomics Data Analysis Olgica Milenkovic Joint work with Ryan Gabrys University of Illinois,
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data and String Reconstruction
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Central Dogma of Genetics
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data Acquisition: DNA Sequencing
Sequencing Methods: Shotgun an nanopore (MinION) sequencing.
- 1. Shotgun sequencing - replicas of long string broken into overlapping
fragments, then assembled;
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data Acquisition: DNA Sequencing
Sequencing Methods: Shotgun an nanopore (MinION) sequencing.
- 1. Shotgun sequencing - replicas of long string broken into overlapping
fragments, then assembled;
- 2. Nanopore sequencing - long strings read directly;
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data Acquisition: Protein Sequencing
Sequencing Methods: Mass spectrometry and nanopore sequencing (less common).
- 1. Mass spectrometry - replicas of long strings broken into (overlapping)
fragments called peptides;
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data Acquisition: Protein Sequencing
Sequencing Methods: Mass spectrometry and nanopore sequencing (less common).
- 1. Mass spectrometry - replicas of long strings broken into (overlapping)
fragments called peptides;
- 2. Peptides broken into prefixes and suffixes referred to as ion series.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the (multi)set SL(x) of its substrings of length L, known as the L-profile (type) of x?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the (multi)set SL(x) of its substrings of length L, known as the L-profile (type) of x? 1.2 Example 1: For x = 10010 and y = 00100, we have S3(x) = S3(y) = {100, 001, 010}, but S4(x) = {1001, 0010} ≠ S4(y) = {0010, 0100}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the (multi)set SL(x) of its substrings of length L, known as the L-profile (type) of x? 1.2 Example 1: For x = 10010 and y = 00100, we have S3(x) = S3(y) = {100, 001, 010}, but S4(x) = {1001, 0010} ≠ S4(y) = {0010, 0100}.
- 2. DNA Nanopore Sequencing and String Reconstruction:
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the (multi)set SL(x) of its substrings of length L, known as the L-profile (type) of x? 1.2 Example 1: For x = 10010 and y = 00100, we have S3(x) = S3(y) = {100, 001, 010}, but S4(x) = {1001, 0010} ≠ S4(y) = {0010, 0100}.
- 2. DNA Nanopore Sequencing and String Reconstruction:
2.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the multiset Kk(x) of all its subsequences of length k?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
Focus on binary sequences. Fundamental questions:
- 1. DNA Shotgun Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the (multi)set SL(x) of its substrings of length L, known as the L-profile (type) of x? 1.2 Example 1: For x = 10010 and y = 00100, we have S3(x) = S3(y) = {100, 001, 010}, but S4(x) = {1001, 0010} ≠ S4(y) = {0010, 0100}.
- 2. DNA Nanopore Sequencing and String Reconstruction:
2.1 When is a string x ∈ {0, 1}n uniquely reconstructable from the multiset Kk(x) of all its subsequences of length k? 2.2 Example 1: For x = 1001 and y = 0110, we have K2(x) = K2(y) = {10, 10, 11, 00, 01, 01}, but {100, 101, 101, 001} = K3(x) ≠ K3(y) = {010, 010, 011, 110}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
- 1. Protein Sequencing and String Reconstruction:
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
- 1. Protein Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from its multiset of substring compositions M(x)?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
- 1. Protein Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from its multiset of substring compositions M(x)? 1.2 Example 1: For x = 010, M(x) = {0, 0, 1, 0111, 0111, 0211}. We know the composition of all substrings, but not the exact order of the symbols in the
- substrings. Clearly, M(001) = M(100), and this is true of all pairs of strings
and their reversals.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data: Problems in String Reconstruction
- 1. Protein Sequencing and String Reconstruction:
1.1 When is a string x ∈ {0, 1}n uniquely reconstructable from its multiset of substring compositions M(x)? 1.2 Example 1: For x = 010, M(x) = {0, 0, 1, 0111, 0111, 0211}. We know the composition of all substrings, but not the exact order of the symbols in the
- substrings. Clearly, M(001) = M(100), and this is true of all pairs of strings
and their reversals. 1.3 Based on the assumption that “mass” may be equated with “composition”.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data and Macromolecular Storage
DNA, proteins and other macromolecules proposed for use as storage media.
- 1. May use encoding to make strings both resilient to errors and to allow for
unique reconstruction!
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Multiomics Data and Macromolecular Storage
DNA, proteins and other macromolecules proposed for use as storage media.
- 1. May use encoding to make strings both resilient to errors and to allow for
unique reconstruction!
- 2. May use encoding for “combined” sequencing techniques to allow for
unique reconstruction!
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Talk Outline
A string x ∈ {0, 1}n is said to be uniquely-reconstructable from an evidence set
- f strings E(x) (E ∈ {S, K, M}) if no other string y ≠ x shares the same
evidence set.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Talk Outline
A string x ∈ {0, 1}n is said to be uniquely-reconstructable from an evidence set
- f strings E(x) (E ∈ {S, K, M}) if no other string y ≠ x shares the same
evidence set.
- 1. Describe under which conditions unique reconstruction is possible?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Talk Outline
A string x ∈ {0, 1}n is said to be uniquely-reconstructable from an evidence set
- f strings E(x) (E ∈ {S, K, M}) if no other string y ≠ x shares the same
evidence set.
- 1. Describe under which conditions unique reconstruction is possible?
- 2. Introduce new coding schemes that make strings uniquely reconstructable?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Talk Outline
A string x ∈ {0, 1}n is said to be uniquely-reconstructable from an evidence set
- f strings E(x) (E ∈ {S, K, M}) if no other string y ≠ x shares the same
evidence set.
- 1. Describe under which conditions unique reconstruction is possible?
- 2. Introduce new coding schemes that make strings uniquely reconstructable?
- 3. Use coding theory arguments to show that strings are uniquely
reconstructable?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substrings
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
As before, let x ∈ {0, 1}n. We say that x is L-substring uniquely reconstructable if no x ≠ y ∈ {0, 1}n has the same L-profile.
- 1. Reconstruction based on substring profiles. Basic problems.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
As before, let x ∈ {0, 1}n. We say that x is L-substring uniquely reconstructable if no x ≠ y ∈ {0, 1}n has the same L-profile.
- 1. Reconstruction based on substring profiles. Basic problems.
- 2. What is the smallest length L that allows for reconstructing (almost) all
strings of length n uniquely?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
As before, let x ∈ {0, 1}n. We say that x is L-substring uniquely reconstructable if no x ≠ y ∈ {0, 1}n has the same L-profile.
- 1. Reconstruction based on substring profiles. Basic problems.
- 2. What is the smallest length L that allows for reconstructing (almost) all
strings of length n uniquely?
- 3. How many strings y ≠ x of length n have the same profile as the sequence
x, Tx?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
As before, let x ∈ {0, 1}n. We say that x is L-substring uniquely reconstructable if no x ≠ y ∈ {0, 1}n has the same L-profile.
- 1. Reconstruction based on substring profiles. Basic problems.
- 2. What is the smallest length L that allows for reconstructing (almost) all
strings of length n uniquely?
- 3. How many strings y ≠ x of length n have the same profile as the sequence
x, Tx?
- 4. What is the number of different profiles NL for given values of n and L?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Problem introduced by Ukkonen, 1992, in the context of pattern
- matching. Important observation: a string is L-substring uniquely
reconstructable if each L − 1-substring occurs at most once.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Problem introduced by Ukkonen, 1992, in the context of pattern
- matching. Important observation: a string is L-substring uniquely
reconstructable if each L − 1-substring occurs at most once.
- 2. Margaritis and Skiena, 1995, showed that unique reconstruction properties
depend on the period of the string (a string x has period p if xi = xi+p, for all 1 ≤ i ≤ n − p). Unique L-substring reconstruction is impossible for strings with p ≤ L. Otherwise, L ≥ ⌊n/2⌋ + 1 suffices.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Problem introduced by Ukkonen, 1992, in the context of pattern
- matching. Important observation: a string is L-substring uniquely
reconstructable if each L − 1-substring occurs at most once.
- 2. Margaritis and Skiena, 1995, showed that unique reconstruction properties
depend on the period of the string (a string x has period p if xi = xi+p, for all 1 ≤ i ≤ n − p). Unique L-substring reconstruction is impossible for strings with p ≤ L. Otherwise, L ≥ ⌊n/2⌋ + 1 suffices.
- 3. Example: S4(0111011) = S4(1110111) = {0111, 1110, 1011, 1101}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the
context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings Tx that share the L = 2-profile
- f x satisfies
1 n2(n + 1)4 exp2(nH(Φ(x)∣F(x)) ≤ Tx ≤ 2 exp2(nH(Φ(x)∣F(x)).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the
context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings Tx that share the L = 2-profile
- f x satisfies
1 n2(n + 1)4 exp2(nH(Φ(x)∣F(x)) ≤ Tx ≤ 2 exp2(nH(Φ(x)∣F(x)).
- 2. Counting arguments: Szpankowski et al., 2015, derived bounds for NL,
the number of different L-profiles, when L = 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Reconstructing Strings from Substring Profiles
- 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the
context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings Tx that share the L = 2-profile
- f x satisfies
1 n2(n + 1)4 exp2(nH(Φ(x)∣F(x)) ≤ Tx ≤ 2 exp2(nH(Φ(x)∣F(x)).
- 2. Counting arguments: Szpankowski et al., 2015, derived bounds for NL,
the number of different L-profiles, when L = 2.
- 3. Enumeration of lattice points in polytopes and Erhart theory: The general
case L ≥ 2 was analyzed by Kiah, Puleo, M, 2016, establishing that for constant L one has NL ∼ c(L) n2L−2L−1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L is allowed to grow with n?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L is allowed to grow with n?
- 1. Kiah et al., 2017 showed that codes of rate one exist provided that
L = 2 log n + const. The code rate is zero for L < log n, which follows from NL ≤ ( n − L + 2L 2L − 1 ).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L is allowed to grow with n?
- 1. Kiah et al., 2017 showed that codes of rate one exist provided that
L = 2 log n + const. The code rate is zero for L < log n, which follows from NL ≤ ( n − L + 2L 2L − 1 ).
- 2. Proofs are nonconstructive.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L grows with n?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L grows with n?
- 1. Gabrys and M, 2018, showed that codes of rate one exist provided that
L > log n. For L = 2 log n + 2, there exists an L-reconstructable code with
- ne bit of redundancy. For L = 2 log n + 5, there exist codes with two bits
- f redundancy.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Main Results: Coding for Unique Reconstruction
Let C ⊂ {0, 1}n. The code C is L-reconstructable if every code string c ∈ C is uniquely reconstructable based on SL(c). What is the maximal size of the code C when L grows with n?
- 1. Gabrys and M, 2018, showed that codes of rate one exist provided that
L > log n. For L = 2 log n + 2, there exists an L-reconstructable code with
- ne bit of redundancy. For L = 2 log n + 5, there exist codes with two bits
- f redundancy.
- 2. For log n < L < 2 log n, and all other parameter regimes, we have explicit
(new) constructions based on repeat replacement.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lessons Learned
- 1. If each substring of length L − 1 in a code string is unique, then the
codestring is L-reconstructable.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lessons Learned
- 1. If each substring of length L − 1 in a code string is unique, then the
codestring is L-reconstructable.
- 2. Example: Let x = 01100101, and L = 4, so that
S4(x) = {0110, 1100, 1001, 0010, 0101}. The string is 3-substring unique, as S3(x) = {011, 110, 100, 001, 010, 101}. Chose 4-substring that contains a 3-string that appears only once as a substring in the strings of S4(x) (e.g., 011). This is the initial substring. Overlap suffix-to-prefix etc (de Bruijn graphs) 0110 → 1100 → 1001...
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lessons Learned
- 1. If each substring of length L − 1 in a code string is unique, then the
codestring is L-reconstructable.
- 2. Example: Let x = 01100101, and L = 4, so that
S4(x) = {0110, 1100, 1001, 0010, 0101}. The string is 3-substring unique, as S3(x) = {011, 110, 100, 001, 010, 101}. Chose 4-substring that contains a 3-string that appears only once as a substring in the strings of S4(x) (e.g., 011). This is the initial substring. Overlap suffix-to-prefix etc (de Bruijn graphs) 0110 → 1100 → 1001...
- 3. Eliminate repeated substrings of length L − 1. How does one transform an
information string into a code string that does not have repeated L − 1-substrings?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Why not Compression?
- 1. Lempel-Ziv type of encoding?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Why not Compression?
- 1. Lempel-Ziv type of encoding?
- 2. No guarantee that it will remove all repeats of length L − 1, even when
using “multiple passes.”
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Inspiration
- 1. Wijngaarden and Immink, 2010, constructed maximum run-length limited
codes based on a new technique called runlength replacement. We developed a new repeat replacement technique.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Inspiration
- 1. Wijngaarden and Immink, 2010, constructed maximum run-length limited
codes based on a new technique called runlength replacement. We developed a new repeat replacement technique.
- 2. Key idea is to remove “short” repeated strings and insert metadata that
allows for retrieving original string.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
- 3. Delete, starting from the left, runlengths of length exactly six. In the first
step, this results in 010011101.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
- 3. Delete, starting from the left, runlengths of length exactly six. In the first
step, this results in 010011101.
- 4. Append B(i) to the modified string, where B(i) ∈ {0, 1}log n denotes the
binary representation of the location integer i, and log n = ⌈log n⌉.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
- 3. Delete, starting from the left, runlengths of length exactly six. In the first
step, this results in 010011101.
- 4. Append B(i) to the modified string, where B(i) ∈ {0, 1}log n denotes the
binary representation of the location integer i, and log n = ⌈log n⌉.
- 5. Since i = 0, we have B(0) = 04 and 0100111010000, since log n = 4.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
- 3. Delete, starting from the left, runlengths of length exactly six. In the first
step, this results in 010011101.
- 4. Append B(i) to the modified string, where B(i) ∈ {0, 1}log n denotes the
binary representation of the location integer i, and log n = ⌈log n⌉.
- 5. Since i = 0, we have B(0) = 04 and 0100111010000, since log n = 4.
- 6. Next, append 10 to arrive at code string 010011101000010. The string 10
indicates that the substring immediately preceding it describes the location
- f a deleted zero runlength of length six.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Runlength Replacement
- 1. Let the information string be 00000001001110, and impose a max zero
runlength constraint rmax = 5. Focus on runlengths of length ≥ rmax + 1.
- 2. Append the substring 1 to the information string. This leads to the string
000000010011101.
- 3. Delete, starting from the left, runlengths of length exactly six. In the first
step, this results in 010011101.
- 4. Append B(i) to the modified string, where B(i) ∈ {0, 1}log n denotes the
binary representation of the location integer i, and log n = ⌈log n⌉.
- 5. Since i = 0, we have B(0) = 04 and 0100111010000, since log n = 4.
- 6. Next, append 10 to arrive at code string 010011101000010. The string 10
indicates that the substring immediately preceding it describes the location
- f a deleted zero runlength of length six.
- 7. Typical repeat length ∼ log n, and log n bits needed for B(i). Code has
- ne redundant bit for correct parameter choice n ≤ 2k−1 + k + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
- 3. Repeat removal (naive approach):
1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000B(0)B(12)01.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
- 3. Repeat removal (naive approach):
1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000B(0)B(12)01.
- 4. Note that len(B(i)B(i′)01) = 2 log n + 2, and L ∼ 2 log n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
- 3. Repeat removal (naive approach):
1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000B(0)B(12)01.
- 4. Note that len(B(i)B(i′)01) = 2 log n + 2, and L ∼ 2 log n.
- 5. Unlike runlength removal, where shortening a runlength of zeros does not
produce a new runlength, removing a repeat may produce a new repeat.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
- 3. Repeat removal (naive approach):
1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000B(0)B(12)01.
- 4. Note that len(B(i)B(i′)01) = 2 log n + 2, and L ∼ 2 log n.
- 5. Unlike runlength removal, where shortening a runlength of zeros does not
produce a new runlength, removing a repeat may produce a new repeat.
- 6. (*) Can repeat the removal iteratively, but how do we guarantee that the
procedure terminates?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
What About Repeat Replacement?
- 1. All input sequences x ∈ {0, 1}n−2 are required to satisfy xL−1 = 1 (see *).
- 2. Let L = 7, x = 10111110000010111110000110000000; first repeat of
length seven when scanned from left is 1011111.
- 3. Repeat removal (naive approach):
1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000B(0)B(12)01.
- 4. Note that len(B(i)B(i′)01) = 2 log n + 2, and L ∼ 2 log n.
- 5. Unlike runlength removal, where shortening a runlength of zeros does not
produce a new runlength, removing a repeat may produce a new repeat.
- 6. (*) Can repeat the removal iteratively, but how do we guarantee that the
procedure terminates?
- 7. (*) Specialized procedure that ensures that at each round, the length of
the input string is reduced by exactly one. Works if we remove L − 1 and add L − 2 = 2 log n + 2 bits back.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Repeat replacement encoder ERR for generating (L − 1)-substring unique
strings.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Repeat replacement encoder ERR for generating (L − 1)-substring unique
strings.
- 2. If xI is (L − 1)-substring unique, set x = xI and STOP. Otherwise, set
x(0) = xI, and let k = 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Repeat replacement encoder ERR for generating (L − 1)-substring unique
strings.
- 2. If xI is (L − 1)-substring unique, set x = xI and STOP. Otherwise, set
x(0) = xI, and let k = 1.
- 3. Suppose that x(k−1) has a repeat at positions (i, j) of length L − 1. Let
x(k) be obtained by deleting x(k−1)
j,L−1 from x(k−1) and appending
B(i)B(j)10 to the end of the generated string.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Repeat replacement encoder ERR for generating (L − 1)-substring unique
strings.
- 2. If xI is (L − 1)-substring unique, set x = xI and STOP. Otherwise, set
x(0) = xI, and let k = 1.
- 3. Suppose that x(k−1) has a repeat at positions (i, j) of length L − 1. Let
x(k) be obtained by deleting x(k−1)
j,L−1 from x(k−1) and appending
B(i)B(j)10 to the end of the generated string.
- 4. If x(k)
L−1 = 0, i.e., if the (L − 1) -st bit of x(k) equals to 0, reset x(k) L−1 = 1
and update the last two bits of x(k) to 11. If x(k) is (L − 1)-substring unique, set x = x(k) and STOP. Otherwise, set k = k + 1, and go to Step 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Encoder ELR for an L-reconstruction code.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Encoder ELR for an L-reconstruction code.
- 2. Let xI ∈ {0, 1}n−2. If the L − 1 th bit of xI is 1, append 10. Otherwise,
set the L − 1 th bit of xI to 1, and append 11.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Algorithm
- 1. Encoder ELR for an L-reconstruction code.
- 2. Let xI ∈ {0, 1}n−2. If the L − 1 th bit of xI is 1, append 10. Otherwise,
set the L − 1 th bit of xI to 1, and append 11.
- 3. If ℓ(ERR(xI)) = n, set x = ERR(xI). Otherwise, append to ERR(xI) as
many zeros as needed to make the string x have length n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Repeat Replacement for log n < L < 2 log n
- 1. Cannot afford encoding two positions for the two repeats.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Repeat Replacement for log n < L < 2 log n
- 1. Cannot afford encoding two positions for the two repeats.
- 2. Can we use only the encoding of the position of the original (first) repeat?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Repeat Replacement for log n < L < 2 log n
- 1. Cannot afford encoding two positions for the two repeats.
- 2. Can we use only the encoding of the position of the original (first) repeat?
- 3. Short markers at positions of repeats of length ∼ log log n. But how do we
distinguish markers from information content and other markers?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Repeat Replacement for log n < L < 2 log n
- 1. Cannot afford encoding two positions for the two repeats.
- 2. Can we use only the encoding of the position of the original (first) repeat?
- 3. Short markers at positions of repeats of length ∼ log log n. But how do we
distinguish markers from information content and other markers?
- 4. Perform runlength coding to remove runlengths exceeding length
∼ log log n. Then use zero-runs of that length+1 as markers.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Repeat replacement in the presence of erroneous and missing substrings.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Repeat replacement in the presence of erroneous and missing substrings.
- 2. Interesting new technique for ensuring code strings have substrings of
length ˆ L at sufficiently large Hamming distance and of sufficiently large weight.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Repeat replacement in the presence of erroneous and missing substrings.
- 2. Interesting new technique for ensuring code strings have substrings of
length ˆ L at sufficiently large Hamming distance and of sufficiently large weight.
- 3. For details, see:
Kiah, Puleo, M, IT 2016, DNA Profile Codes, Gabrys, M, ISIT 2018, and arxive preprint Unique Reconstruction of Coded Strings from Multiset Substring Spectra at https://arxiv.org/abs/1804.04548
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Subsequences
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The k-Deck Problem
- 1. The k-deck for x ∈ {0, 1}n is the multiset of all subsequences of x of
length k.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The k-Deck Problem
- 1. The k-deck for x ∈ {0, 1}n is the multiset of all subsequences of x of
length k.
- 2. The k-deck problem asks one to determine the minimum value of k,
denoted by f(n), necessary to uniquely reconstruct any x ∈ {0, 1}n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The k-Deck Problem
- 1. The k-deck for x ∈ {0, 1}n is the multiset of all subsequences of x of
length k.
- 2. The k-deck problem asks one to determine the minimum value of k,
denoted by f(n), necessary to uniquely reconstruct any x ∈ {0, 1}n.
- 3. Kalashnik, 1973, introduced the k-deck problem.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The k-Deck Problem
- 1. The k-deck for x ∈ {0, 1}n is the multiset of all subsequences of x of
length k.
- 2. The k-deck problem asks one to determine the minimum value of k,
denoted by f(n), necessary to uniquely reconstruct any x ∈ {0, 1}n.
- 3. Kalashnik, 1973, introduced the k-deck problem.
- 4. Krasikov and Roditty, 1997, showed that f(n)≥(1 + o(1)) 16
7
√n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The k-Deck Problem
- 1. The k-deck for x ∈ {0, 1}n is the multiset of all subsequences of x of
length k.
- 2. The k-deck problem asks one to determine the minimum value of k,
denoted by f(n), necessary to uniquely reconstruct any x ∈ {0, 1}n.
- 3. Kalashnik, 1973, introduced the k-deck problem.
- 4. Krasikov and Roditty, 1997, showed that f(n)≥(1 + o(1)) 16
7
√n.
- 5. Dudik and Schulman, 2003, proved that f(n)≥exp(Ω(log1/2 n)).
- 6. Scott, 2011, proved that f(n)≤(1 + o(1)) √n log n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Trace Reconstruction Problem
- 1. One is given a collection of m random subsequences (traces) of a string of
length n, obtained by an iid deletion process with probability q. When is unique reconstruction possible?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Trace Reconstruction Problem
- 1. One is given a collection of m random subsequences (traces) of a string of
length n, obtained by an iid deletion process with probability q. When is unique reconstruction possible?
- 2. Problem: determine achievable (m, q) pairs for a given n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Trace Reconstruction Problem
- 1. One is given a collection of m random subsequences (traces) of a string of
length n, obtained by an iid deletion process with probability q. When is unique reconstruction possible?
- 2. Problem: determine achievable (m, q) pairs for a given n.
- 3. Introduced by Batu et al, 2004, who showed that:
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Trace Reconstruction Problem
- 1. One is given a collection of m random subsequences (traces) of a string of
length n, obtained by an iid deletion process with probability q. When is unique reconstruction possible?
- 2. Problem: determine achievable (m, q) pairs for a given n.
- 3. Introduced by Batu et al, 2004, who showed that:
3.1 For random strings, (m ∼ log n, q ∼
1 log n ) is achievable. Since, improved
to (m = exp(O(log1/3 n)), q < 1
2 ) by Holden et al, 2018.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Trace Reconstruction Problem
- 1. One is given a collection of m random subsequences (traces) of a string of
length n, obtained by an iid deletion process with probability q. When is unique reconstruction possible?
- 2. Problem: determine achievable (m, q) pairs for a given n.
- 3. Introduced by Batu et al, 2004, who showed that:
3.1 For random strings, (m ∼ log n, q ∼
1 log n ) is achievable. Since, improved
to (m = exp(O(log1/3 n)), q < 1
2 ) by Holden et al, 2018.
3.2 For arbitrary strings, (m = O(n log n), q = (
1 n1/2+ǫ )) , ǫ > 0, is achievable.
Since, improved to (m = exp(O(n1/3)), q < 1
2 ) by De et al. and Nazarov
and Perez, 2018.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Simplest Trace Reconstruction Algorithm
- 1. Align sequence on the left. Find majority bit value at the first position:
1 1 1 1 ? 1 1 1 1 1 1 1 1 ? ? ? 1 1 1 1 1 1
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Simplest Trace Reconstruction Algorithm
- 1. Align sequence on the left. Find majority bit value at the first position:
1 1 1 1 ? 1 1 1 1 1 1 1 1 ? ? ? 1 1 1 1 1 1
- 2. Shift traces in disagreement with the majority to the right. Find majority
bit value at the second position. Break ties arbitrarily: 1 1 1 1 ? x 1 1 1 1 1 1 1 ? ? ? 1 1 1 1 1 1 1
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Simplest Trace Reconstruction Algorithm
- 1. Align sequence on the left. Find majority bit value at the first position:
1 1 1 1 ? 1 1 1 1 1 1 1 1 ? ? ? 1 1 1 1 1 1
- 2. Shift traces in disagreement with the majority to the right. Find majority
bit value at the second position. Break ties arbitrarily: 1 1 1 1 ? x 1 1 1 1 1 1 1 ? ? ? 1 1 1 1 1 1 1
- 3. Iterate.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Coded k-Deck and Trace Reconstruction?
- 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction
problem.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Coded k-Deck and Trace Reconstruction?
- 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction
problem.
- 2. Empirical evidence that string balancing helps reducing the number of
traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Coded k-Deck and Trace Reconstruction?
- 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction
problem.
- 2. Empirical evidence that string balancing helps reducing the number of
traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College.
- 3. Sala and Dolecek, 2016, showed that if x is a codeword of a (ℓ − 1)
insertion/deletion correcting code, and each channel introduces at most t ≥ ℓ errors, then one needs at most
t
∑
j=ℓ t−j
∑
i=0
(−1)t+j−i (2j j )(t + j − i 2j )(n + t i ) traces for exact reconstruction.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Coded k-Deck and Trace Reconstruction?
- 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction
problem.
- 2. Empirical evidence that string balancing helps reducing the number of
traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College.
- 3. Sala and Dolecek, 2016, showed that if x is a codeword of a (ℓ − 1)
insertion/deletion correcting code, and each channel introduces at most t ≥ ℓ errors, then one needs at most
t
∑
j=ℓ t−j
∑
i=0
(−1)t+j−i (2j j )(t + j − i 2j )(n + t i ) traces for exact reconstruction.
- 4. Nothing known about coded k-deck reconstruction (yet)!
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
A Practical Problem
- 1. With nanopore sequencers, long traces are produced early during the
sequencing process. After a certain point, the pores “tire” and only fairly short traces may be retrieved.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
A Practical Problem
- 1. With nanopore sequencers, long traces are produced early during the
sequencing process. After a certain point, the pores “tire” and only fairly short traces may be retrieved.
- 2. How long should one run the sequencer so that unique reconstruction is
possible?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
bits, and
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
bits, and
- 2. The multiset of all length-k subsequences (the k-deck) of x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
bits, and
- 2. The multiset of all length-k subsequences (the k-deck) of x.
The set U represents the long traces while the k-deck is the set of short traces.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
bits, and
- 2. The multiset of all length-k subsequences (the k-deck) of x.
The set U represents the long traces while the k-deck is the set of short traces. For a given n, M and t, the minimum value of k for unique reconstruction is denoted by f(n, t, M).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
zeros, and
- 2. The multiset of all length-k subsequences (the k-deck) of x.
The set U represents the long traces while the k-deck is the set of short traces. For a given n, M and t, the minimum value of k for unique reconstruction is denoted by f(n, t, M).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Mathematical Formulation
Wish to reconstruct a coded x ∈ {0, 1}n given the following:
- 1. A set U ∈ {0, 1}n−t of M subsequences of x obtained by deleting up to t
zeros, and
- 2. The multiset of all length-k subsequences (the k-deck) of x.
The set U represents the long traces while the k-deck is the set of short traces. For a given n, M and t, the minimum value of k for unique reconstruction is denoted by f(n, t, M).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
1.1 For t = o(n), f(n, t, 1) ≤ t + 1 and f(n, t, 1) ≥ log t + 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
1.1 For t = o(n), f(n, t, 1) ≤ t + 1 and f(n, t, 1) ≥ log t + 2.
1.1.1 For t ≤ 4, the upper bound is tight.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
1.1 For t = o(n), f(n, t, 1) ≤ t + 1 and f(n, t, 1) ≥ log t + 2.
1.1.1 For t ≤ 4, the upper bound is tight.
1.2 For t = nǫ, ǫ > 0, exp(Ω(log1/2 n)) ≤ f(n, t, 1) ≤ O(√n).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
1.1 For t = o(n), f(n, t, 1) ≤ t + 1 and f(n, t, 1) ≥ log t + 2.
1.1.1 For t ≤ 4, the upper bound is tight.
1.2 For t = nǫ, ǫ > 0, exp(Ω(log1/2 n)) ≤ f(n, t, 1) ≤ O(√n).
1.2.1 The value of k depends on the Hamming weight of the vector x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Results: M=1
- 1. For t = 1, f(n, 1, 1) = 2.
1.1 For t = o(n), f(n, t, 1) ≤ t + 1 and f(n, t, 1) ≥ log t + 2.
1.1.1 For t ≤ 4, the upper bound is tight.
1.2 For t = nǫ, ǫ > 0, exp(Ω(log1/2 n)) ≤ f(n, t, 1) ≤ O(√n).
1.2.1 The value of k depends on the Hamming weight of the vector x.
- 2. For larger values of M, derived upper and lower bounds on f(n, t, M) in
terms of f(n, t, 1).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
- 2. Suppose that y = (1, 1, 1) is the “long trace”.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
- 2. Suppose that y = (1, 1, 1) is the “long trace”.
- 3. The 2-deck for x is:
X = {(1, 1), (1, 1), (1, 0), (1, 1), (1, 0), (1, 0)}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
- 2. Suppose that y = (1, 1, 1) is the “long trace”.
- 3. The 2-deck for x is:
X = {(1, 1), (1, 1), (1, 0), (1, 1), (1, 0), (1, 0)}.
- 4. The 2-deck for y is:
Y = {(1, 1), (1, 1), (1, 1)}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
- 2. Suppose that y = (1, 1, 1) is the “long trace”.
- 3. The 2-deck for x is:
X = {(1, 1), (1, 1), (1, 0), (1, 1), (1, 0), (1, 0)}.
- 4. The 2-deck for y is:
Y = {(1, 1), (1, 1), (1, 1)}.
- 5. Thus,
X ∖ Y = {(1, 0), (1, 0), (1, 0)}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example
- 1. Suppose that x = (1, 1, 1, 0).
- 2. Suppose that y = (1, 1, 1) is the “long trace”.
- 3. The 2-deck for x is:
X = {(1, 1), (1, 1), (1, 0), (1, 1), (1, 0), (1, 0)}.
- 4. The 2-deck for y is:
Y = {(1, 1), (1, 1), (1, 1)}.
- 5. Thus,
X ∖ Y = {(1, 0), (1, 0), (1, 0)}.
- 6. Since the subsequence (1, 0) appears 3 times in X ∖ Y, it follows that a
zero should be inserted after the third one in y to obtain x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one. Then, ni =
n
∑
j=1
( j − 1 i − 1 )xj, for i ∈ {1, 2}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one. Then, ni =
n
∑
j=1
( j − 1 i − 1 )xj, for i ∈ {1, 2}.
- 4. Note hat n1 = ∑n
j=1 xj and n2 = ∑n j=1(j − 1) xj.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one. Then, ni =
n
∑
j=1
( j − 1 i − 1 )xj, for i ∈ {1, 2}.
- 4. Note hat n1 = ∑n
j=1 xj and n2 = ∑n j=1(j − 1) xj. Let
V T(x) = n1 + n2 =
n
∑
j=1
j xj, and let a = V T(x) mod n + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one. Then, ni =
n
∑
j=1
( j − 1 i − 1 )xj, for i ∈ {1, 2}.
- 4. Note hat n1 = ∑n
j=1 xj and n2 = ∑n j=1(j − 1) xj. Let
V T(x) = n1 + n2 =
n
∑
j=1
j xj, and let a = V T(x) mod n + 1.
- 5. Then, x ∈ C(n, a) where C(n, a) = {x ∶ ∑n
j=1 i xj ≡ a mod (n + 1)}, which
is a VT single deletion correcting code.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Upper Bound, t = 1
- 1. Inspired by a beautiful paper by Scott, 2009, on k-deck reconstruction and
reconstruction over Zn.
- 2. Suppose that x ∈ {0, 1}n.
- 3. Let ni denote the number of subsequences of x of length i that end with
a one. Then, ni =
n
∑
j=1
( j − 1 i − 1 )xj, for i ∈ {1, 2}.
- 4. Note hat n1 = ∑n
j=1 xj and n2 = ∑n j=1(j − 1) xj. Let
V T(x) = n1 + n2 =
n
∑
j=1
j xj, and let a = V T(x) mod n + 1.
- 5. Then, x ∈ C(n, a) where C(n, a) = {x ∶ ∑n
j=1 i xj ≡ a mod (n + 1)}, which
is a VT single deletion correcting code.
- 6. Thus, can correct one error in the long trace and f(n, 1, 1) = 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Generalizing the Upper Bound for t ≥ 2
- 1. Can generalize the approach for t = 1 by keeping track of the number of
subsequences in Kk(x) and Kk(y) of the following form: 111 . . . 10, where runlength depends on t.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Generalizing the Upper Bound for t ≥ 2
- 1. Can generalize the approach for t = 1 by keeping track of the number of
subsequences in Kk(x) and Kk(y) of the following form: 111 . . . 10, where runlength depends on t.
- 2. Showed using Newton’s identities that it is possible to recover the
locations of the deletions in x that lead to y using the above counts.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Generalizing the Upper Bound for t ≥ 2
- 1. Can generalize the approach for t = 1 by keeping track of the number of
subsequences in Kk(x) and Kk(y) of the following form: 111 . . . 10, where runlength depends on t.
- 2. Showed using Newton’s identities that it is possible to recover the
locations of the deletions in x that lead to y using the above counts.
- 3. Then, given these locations and the long trace y ∈ {0, 1}n−t, it is possible
to recover x from y, provided that k ≥ t + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Generalizing the Upper Bound for t ≥ 2
- 1. Can generalize the approach for t = 1 by keeping track of the number of
subsequences in Kk(x) and Kk(y) of the following form: 111 . . . 10, where runlength depends on t.
- 2. Showed using Newton’s identities that it is possible to recover the
locations of the deletions in x that lead to y using the above counts.
- 3. Then, given these locations and the long trace y ∈ {0, 1}n−t, it is possible
to recover x from y, provided that k ≥ t + 1.
- 4. For positive integers n ≥ 2 and t < n, one has
f(n, t, 1) ≤ t + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lower Bounds: Some Notation
- 1. For any x ∈ {0, 1}n, let Dt(x) be the set of n − t subsequences that may
be obtained by deleting zeros from x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lower Bounds: Some Notation
- 1. For any x ∈ {0, 1}n, let Dt(x) be the set of n − t subsequences that may
be obtained by deleting zeros from x.
1.1 If x = (0, 1, 1, 0, 1), then D1(x) = {(1, 1, 0, 1), (0, 1, 1, 1)}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lower Bounds: Some Notation
- 1. For any x ∈ {0, 1}n, let Dt(x) be the set of n − t subsequences that may
be obtained by deleting zeros from x.
1.1 If x = (0, 1, 1, 0, 1), then D1(x) = {(1, 1, 0, 1), (0, 1, 1, 1)}.
- 2. For any x ∈ {0, 1}n, let It(x) be the set of n + t supersequences that may
be obtained by inserting zeros into x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lower Bounds: Some Notation
- 1. For any x ∈ {0, 1}n, let Dt(x) be the set of n − t subsequences that may
be obtained by deleting zeros from x.
1.1 If x = (0, 1, 1, 0, 1), then D1(x) = {(1, 1, 0, 1), (0, 1, 1, 1)}.
- 2. For any x ∈ {0, 1}n, let It(x) be the set of n + t supersequences that may
be obtained by inserting zeros into x.
2.1 If x = (0, 1), then I1(x) = {(0, 0, 1), (0, 1, 0)}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Lower Bounds: Some Notation
- 1. For any x ∈ {0, 1}n, let Dt(x) be the set of n − t subsequences that may
be obtained by deleting zeros from x.
1.1 If x = (0, 1, 1, 0, 1), then D1(x) = {(1, 1, 0, 1), (0, 1, 1, 1)}.
- 2. For any x ∈ {0, 1}n, let It(x) be the set of n + t supersequences that may
be obtained by inserting zeros into x.
2.1 If x = (0, 1), then I1(x) = {(0, 0, 1), (0, 1, 0)}.
- 3. As before, let Kk(x) denote the k-deck of x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
1.1 Note that K1(x(1)) = K1(y(1)) and that 1 ∈ D1(x(1)) ∩ D1(y(1)), which implies f(n, 1, 1) > 1 for n ≥ 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
1.1 Note that K1(x(1)) = K1(y(1)) and that 1 ∈ D1(x(1)) ∩ D1(y(1)), which implies f(n, 1, 1) > 1 for n ≥ 2.
- 2. Consider the following strings:
x(2) = x(1)y(1), y(2) = y(1)x(1)
- 3. Note that K2(x(2)) = K2(y(2)) and that 11 ∈ D2(x(2)) ∩ D2(y(2)), which
implies that f(n, 2, 1) > 2 for n ≥ 4.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
1.1 Note that K1(x(1)) = K1(y(1)) and that 1 ∈ D1(x(1)) ∩ D1(y(1)), which implies f(n, 1, 1) > 1 for n ≥ 2.
- 2. Consider the following strings:
x(2) = x(1)y(1), y(2) = y(1)x(1)
- 3. Note that K2(x(2)) = K2(y(2)) and that 11 ∈ D2(x(2)) ∩ D2(y(2)), which
implies that f(n, 2, 1) > 2 for n ≥ 4.
- 4. Iterating above procedure, get that for t ≤ n
2 ,
f(n, t, 1) ≥ log t + 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
1.1 Note that K1(x(1)) = K1(y(1)) and that 1 ∈ D1(x(1)) ∩ D1(y(1)), which implies f(n, 1, 1) > 1 for n ≥ 2.
- 2. Consider the following strings:
x(2) = x(1)y(1), y(2) = y(1)x(1)
- 3. Note that K2(x(2)) = K2(y(2)) and that 11 ∈ D2(x(2)) ∩ D2(y(2)), which
implies that f(n, 2, 1) > 2 for n ≥ 4.
- 4. Iterating above procedure, get that for t ≤ n
2 ,
f(n, t, 1) ≥ log t + 2.
- 5. Sequences above known as Morse-Thue (fractal) strings.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Lower Bound
- 1. Let x(1) = 01 and y(1) = 10.
1.1 Note that K1(x(1)) = K1(y(1)) and that 1 ∈ D1(x(1)) ∩ D1(y(1)), which implies f(n, 1, 1) > 1 for n ≥ 2.
- 2. Consider the following strings:
x(2) = x(1)y(1), y(2) = y(1)x(1)
- 3. Note that K2(x(2)) = K2(y(2)) and that 11 ∈ D2(x(2)) ∩ D2(y(2)), which
implies that f(n, 2, 1) > 2 for n ≥ 4.
- 4. Iterating above procedure, get that for t ≤ n
2 ,
f(n, t, 1) ≥ log t + 2.
- 5. Sequences above known as Morse-Thue (fractal) strings.
- 6. For t ≤ 4 and n ≥ 2t,
f(n, t, 1) = t + 1.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
- 2. For x = 10110010, one has
R(x) = 01021.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
- 2. For x = 10110010, one has
R(x) = 01021.
- 3. Suppose that t = 2, M = 2 and that U = {111010, 101101} = {y(1), y(2)}.
Then, R(1) = R(y(1)) = 00011, R(2) = R(y(2)) = 01010.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
- 2. For x = 10110010, one has
R(x) = 01021.
- 3. Suppose that t = 2, M = 2 and that U = {111010, 101101} = {y(1), y(2)}.
Then, R(1) = R(y(1)) = 00011, R(2) = R(y(2)) = 01010.
- 4. Define Y = (Y 1, . . . , Y 5), where Y i = max{R(1)
i
, R(2)
i
}.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
- 2. For x = 10110010, one has
R(x) = 01021.
- 3. Suppose that t = 2, M = 2 and that U = {111010, 101101} = {y(1), y(2)}.
Then, R(1) = R(y(1)) = 00011, R(2) = R(y(2)) = 01010.
- 4. Define Y = (Y 1, . . . , Y 5), where Y i = max{R(1)
i
, R(2)
i
}. Then, Y = 01011 and we let y be the sequence for which R(y) = Y , y = 1011010.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
An Example: Handling M > 1
- 1. Let R(x) be an integer-valued sequence that counts the number of 0s
between two 1s in x.
- 2. For x = 10110010, one has
R(x) = 01021.
- 3. Suppose that t = 2, M = 2 and that U = {111010, 101101} = {y(1), y(2)}.
Then, R(1) = R(y(1)) = 00011, R(2) = R(y(2)) = 01010.
- 4. Define Y = (Y 1, . . . , Y 5), where Y i = max{R(1)
i
, R(2)
i
}. Then, Y = 01011 and we let y be the sequence for which R(y) = Y , y = 1011010.
- 5. Note that y ∈ D1(x) and we can recover x from y ∈ D1(x) and Kk(x) if
k = f(n, 1, 1) which implies f(n, 2, 2) ≤ f(n, 1, 1) ≤ 2.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Proof Technique
- 1. The approach is to find a sequence ¯
y such that for any y ∈ U, y ∈ Dv(y) for some v ≤ t.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Proof Technique
- 1. The approach is to find a sequence ¯
y such that for any y ∈ U, y ∈ Dv(y) for some v ≤ t.
- 2. Clearly, this will lead to an upper bound on the quantity f(n, t, M).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Proof Technique
- 1. The approach is to find a sequence ¯
y such that for any y ∈ U, y ∈ Dv(y) for some v ≤ t.
- 2. Clearly, this will lead to an upper bound on the quantity f(n, t, M).
- 3. A similar approach also gives a lower bound on the quantity f(n, t, M).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Proof Technique
- 1. The approach is to find a sequence ¯
y such that for any y ∈ U, y ∈ Dv(y) for some v ≤ t.
- 2. Clearly, this will lead to an upper bound on the quantity f(n, t, M).
- 3. A similar approach also gives a lower bound on the quantity f(n, t, M).
- 4. Key claim. Let x ∈ {0, 1}n, y ∈ {0, 1}n be such that there exists a
w ∈ It(x) ∩ It(y). Suppose z ∈ It0(x) ∩ It0(y) for some t0 < t. Then, w ∈ It−t0(z).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
The Proof Technique
- 1. The approach is to find a sequence ¯
y such that for any y ∈ U, y ∈ Dv(y) for some v ≤ t.
- 2. Clearly, this will lead to an upper bound on the quantity f(n, t, M).
- 3. A similar approach also gives a lower bound on the quantity f(n, t, M).
- 4. Key claim. Let x ∈ {0, 1}n, y ∈ {0, 1}n be such that there exists a
w ∈ It(x) ∩ It(y). Suppose z ∈ It0(x) ∩ It0(y) for some t0 < t. Then, w ∈ It−t0(z).
- 5. Using this claim along with counting arguments, can obtain the claimed
upper and lower bounds for f(n, t, M) as a function of f(n, t).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Analysis of strings encoded using deletion-correcting codes.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Analysis of strings encoded using deletion-correcting codes.
- 2. For details, see Gabrys, M, ISIT 2017, and arxive preprint.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. Analysis of strings encoded using deletion-correcting codes.
- 2. For details, see Gabrys, M, ISIT 2017, and arxive preprint.
- 3. For practical applications, see Yazdi, Gabrys, M, 2016, Nature SR.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Introduced by Acharya et al, 2011, for the purpose of analyzing the
reconstruction of proteins based on MS/MS data.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Introduced by Acharya et al, 2011, for the purpose of analyzing the
reconstruction of proteins based on MS/MS data.
- 2. All strings of length ≤ 7, one less than a prime, or one less than twice a
prime, can be reconstructed uniquely up to reversal.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Introduced by Acharya et al, 2011, for the purpose of analyzing the
reconstruction of proteins based on MS/MS data.
- 2. All strings of length ≤ 7, one less than a prime, or one less than twice a
prime, can be reconstructed uniquely up to reversal.
- 3. Combinatorial arguments used to derive the number of strings with the
same substring composition, whenever unique reconstruction is not possible.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Introduced by Acharya et al, 2011, for the purpose of analyzing the
reconstruction of proteins based on MS/MS data.
- 2. All strings of length ≤ 7, one less than a prime, or one less than twice a
prime, can be reconstructed uniquely up to reversal.
- 3. Combinatorial arguments used to derive the number of strings with the
same substring composition, whenever unique reconstruction is not possible.
- 4. Based on analyzing a new version of the turnpike problem:
Given a multiset of (n
k) of positive numbers ∆X, does there exist a set X
such that ∆X equals the multiset of all absolute values of pairwise differences of the elements of X?
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Introduced by Acharya et al, 2011, for the purpose of analyzing the
reconstruction of proteins based on MS/MS data.
- 2. All strings of length ≤ 7, one less than a prime, or one less than twice a
prime, can be reconstructed uniquely up to reversal.
- 3. Combinatorial arguments used to derive the number of strings with the
same substring composition, whenever unique reconstruction is not possible.
- 4. Based on analyzing a new version of the turnpike problem:
Given a multiset of (n
k) of positive numbers ∆X, does there exist a set X
such that ∆X equals the multiset of all absolute values of pairwise differences of the elements of X?
- 5. Studied by Skiena et al, 1997-1998. One of the key results asserts the
following. Let g(z) = ∑n
i=1 zai be the generating function of a multiset {a1, . . . , an}.
If Q(z) is the generating function of ∆X ∪ (−∆X), and P(z) is the generating function of X, then Q(z) + n = P(z) P (1 z ) .
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Let g(z) = ∑n
i=1 zai be the generating function of a multiset {a1, . . . , an}.
If Q(z) is the generating function of ∆X ∪ (−∆X), and P(z) is the generating function of X, then Q(z) + n = P(z) P (1 z ) .
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Let g(z) = ∑n
i=1 zai be the generating function of a multiset {a1, . . . , an}.
If Q(z) is the generating function of ∆X ∪ (−∆X), and P(z) is the generating function of X, then Q(z) + n = P(z) P (1 z ) .
- 2. The MS/MS problem requires one to analyze different types of generating
functions, since each composition 0ai1bi of a substring has to be described by a monomial xaiybi. The sum of all such monomials equals Q(x). If P(x, y) equals the composition polynomial for all prefixes of length i of a given string s, Ps(x, y) =
n
∑
i=1
xaiyi−ai, (e.g., P0100(x, y) = 1 + x + xy + x2y + x3y,), then Ps(x, y)Ps(1/x, 1/y) = n + 1 + Qs(x, y)Qs(1/x, 1/y).
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Substring Composition: Results
- 1. Let g(z) = ∑n
i=1 zai be the generating function of a multiset {a1, . . . , an}.
If Q(z) is the generating function of ∆X ∪ (−∆X), and P(z) is the generating function of X, then Q(z) + n = P(z) P (1 z ) .
- 2. The MS/MS problem requires one to analyze different types of generating
functions, since each composition 0ai1bi of a substring has to be described by a monomial xaiybi. The sum of all such monomials equals Q(x). If P(x, y) equals the composition polynomial for all prefixes of length i of a given string s, Ps(x, y) =
n
∑
i=1
xaiyi−ai, (e.g., P0100(x, y) = 1 + x + xy + x2y + x3y,), then Ps(x, y)Ps(1/x, 1/y) = n + 1 + Qs(x, y)Qs(1/x, 1/y).
- 3. Many open problems regarding coded reconstruction and reconstruction in
the presence of errors.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. For detailed problem formulation and solutions, see
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. For detailed problem formulation and solutions, see
- 2. Acharya, Das, M, Orlitsky, Pan, String Reconstruction from Substring
Compositions, SIAM J. Discrete Math., 2015.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!
Additional Results
- 1. For detailed problem formulation and solutions, see
- 2. Acharya, Das, M, Orlitsky, Pan, String Reconstruction from Substring
Compositions, SIAM J. Discrete Math., 2015.
- 3. Coded version work in progress.
Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU!