string reconstruction problems in multiomics data analysis
play

String Reconstruction Problems in Multiomics Data Analysis Olgica - PowerPoint PPT Presentation

Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! String Reconstruction Problems in Multiomics Data Analysis Olgica Milenkovic Joint work with Ryan Gabrys University of Illinois,


  1. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Talk Outline A string x ∈ { 0 , 1 } n is said to be uniquely-reconstructable from an evidence set of strings E ( x ) ( E ∈ { S, K, M } ) if no other string y ≠ x shares the same evidence set. 1. Describe under which conditions unique reconstruction is possible? 2. Introduce new coding schemes that make strings uniquely reconstructable? 3. Use coding theory arguments to show that strings are uniquely reconstructable?

  2. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Substrings

  3. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles As before, let x ∈ { 0 , 1 } n . We say that x is L -substring uniquely reconstructable if no x ≠ y ∈ { 0 , 1 } n has the same L -profile. 1. Reconstruction based on substring profiles. Basic problems.

  4. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles As before, let x ∈ { 0 , 1 } n . We say that x is L -substring uniquely reconstructable if no x ≠ y ∈ { 0 , 1 } n has the same L -profile. 1. Reconstruction based on substring profiles. Basic problems. 2. What is the smallest length L that allows for reconstructing (almost) all strings of length n uniquely?

  5. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles As before, let x ∈ { 0 , 1 } n . We say that x is L -substring uniquely reconstructable if no x ≠ y ∈ { 0 , 1 } n has the same L -profile. 1. Reconstruction based on substring profiles. Basic problems. 2. What is the smallest length L that allows for reconstructing (almost) all strings of length n uniquely? 3. How many strings y ≠ x of length n have the same profile as the sequence x , T x ?

  6. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles As before, let x ∈ { 0 , 1 } n . We say that x is L -substring uniquely reconstructable if no x ≠ y ∈ { 0 , 1 } n has the same L -profile. 1. Reconstruction based on substring profiles. Basic problems. 2. What is the smallest length L that allows for reconstructing (almost) all strings of length n uniquely? 3. How many strings y ≠ x of length n have the same profile as the sequence x , T x ? 4. What is the number of different profiles N L for given values of n and L ?

  7. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Problem introduced by Ukkonen, 1992, in the context of pattern matching. Important observation: a string is L -substring uniquely reconstructable if each L − 1 -substring occurs at most once.

  8. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Problem introduced by Ukkonen, 1992, in the context of pattern matching. Important observation: a string is L -substring uniquely reconstructable if each L − 1 -substring occurs at most once. 2. Margaritis and Skiena, 1995, showed that unique reconstruction properties depend on the period of the string (a string x has period p if x i = x i + p , for all 1 ≤ i ≤ n − p ). Unique L -substring reconstruction is impossible for strings with p ≤ L . Otherwise, L ≥ ⌊ n / 2 ⌋ + 1 suffices.

  9. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Problem introduced by Ukkonen, 1992, in the context of pattern matching. Important observation: a string is L -substring uniquely reconstructable if each L − 1 -substring occurs at most once. 2. Margaritis and Skiena, 1995, showed that unique reconstruction properties depend on the period of the string (a string x has period p if x i = x i + p , for all 1 ≤ i ≤ n − p ). Unique L -substring reconstruction is impossible for strings with p ≤ L . Otherwise, L ≥ ⌊ n / 2 ⌋ + 1 suffices. 3. Example: S 4 ( 0111011 ) = S 4 ( 1110111 ) = { 0111 , 1110 , 1011 , 1101 } .

  10. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings T x that share the L = 2 -profile of x satisfies n 2 ( n + 1 ) 4 exp 2 ( nH ( Φ ( x )∣ F ( x )) ≤ T x ≤ 2 exp 2 ( nH ( Φ ( x )∣ F ( x )) . 1

  11. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings T x that share the L = 2 -profile of x satisfies n 2 ( n + 1 ) 4 exp 2 ( nH ( Φ ( x )∣ F ( x )) ≤ T x ≤ 2 exp 2 ( nH ( Φ ( x )∣ F ( x )) . 1 2. Counting arguments: Szpankowski et al., 2015, derived bounds for N L , the number of different L -profiles, when L = 2 .

  12. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Reconstructing Strings from Substring Profiles 1. Information-theoretic approaches: Davisson, Longo, Sggaro, 1981, in the context of deriving error exponents for noiseless encoding of Markov sources, showed that the number of strings T x that share the L = 2 -profile of x satisfies n 2 ( n + 1 ) 4 exp 2 ( nH ( Φ ( x )∣ F ( x )) ≤ T x ≤ 2 exp 2 ( nH ( Φ ( x )∣ F ( x )) . 1 2. Counting arguments: Szpankowski et al., 2015, derived bounds for N L , the number of different L -profiles, when L = 2 . 3. Enumeration of lattice points in polytopes and Erhart theory: The general case L ≥ 2 was analyzed by Kiah, Puleo, M, 2016, establishing that for constant L one has N L ∼ c ( L ) n 2 L − 2 L − 1 .

  13. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L is allowed to grow with n ?

  14. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L is allowed to grow with n ? 1. Kiah et al., 2017 showed that codes of rate one exist provided that L = 2 log n + const. The code rate is zero for L < log n , which follows from N L ≤ ( n − L + 2 L ) . 2 L − 1

  15. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L is allowed to grow with n ? 1. Kiah et al., 2017 showed that codes of rate one exist provided that L = 2 log n + const. The code rate is zero for L < log n , which follows from N L ≤ ( n − L + 2 L ) . 2 L − 1 2. Proofs are nonconstructive.

  16. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L grows with n ?

  17. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L grows with n ? 1. Gabrys and M, 2018, showed that codes of rate one exist provided that L > log n. For L = 2 log n + 2 , there exists an L -reconstructable code with one bit of redundancy. For L = 2 log n + 5 , there exist codes with two bits of redundancy.

  18. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Main Results: Coding for Unique Reconstruction Let C ⊂ { 0 , 1 } n . The code C is L -reconstructable if every code string c ∈ C is uniquely reconstructable based on S L ( c ) . What is the maximal size of the code C when L grows with n ? 1. Gabrys and M, 2018, showed that codes of rate one exist provided that L > log n. For L = 2 log n + 2 , there exists an L -reconstructable code with one bit of redundancy. For L = 2 log n + 5 , there exist codes with two bits of redundancy. 2. For log n < L < 2 log n , and all other parameter regimes, we have explicit (new) constructions based on repeat replacement.

  19. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Lessons Learned 1. If each substring of length L − 1 in a code string is unique, then the codestring is L -reconstructable.

  20. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Lessons Learned 1. If each substring of length L − 1 in a code string is unique, then the codestring is L -reconstructable. 2. Example: Let x = 01100101 , and L = 4 , so that S 4 ( x ) = { 0110 , 1100 , 1001 , 0010 , 0101 } . The string is 3 -substring unique, as S 3 ( x ) = { 011 , 110 , 100 , 001 , 010 , 101 } . Chose 4 -substring that contains a 3 -string that appears only once as a substring in the strings of S 4 ( x ) (e.g., 011 ). This is the initial substring. Overlap suffix-to-prefix etc (de Bruijn graphs) 0110 → 1100 → 1001 ...

  21. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Lessons Learned 1. If each substring of length L − 1 in a code string is unique, then the codestring is L -reconstructable. 2. Example: Let x = 01100101 , and L = 4 , so that S 4 ( x ) = { 0110 , 1100 , 1001 , 0010 , 0101 } . The string is 3 -substring unique, as S 3 ( x ) = { 011 , 110 , 100 , 001 , 010 , 101 } . Chose 4 -substring that contains a 3 -string that appears only once as a substring in the strings of S 4 ( x ) (e.g., 011 ). This is the initial substring. Overlap suffix-to-prefix etc (de Bruijn graphs) 0110 → 1100 → 1001 ... 3. Eliminate repeated substrings of length L − 1 . How does one transform an information string into a code string that does not have repeated L − 1 -substrings?

  22. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Why not Compression? 1. Lempel-Ziv type of encoding?

  23. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Why not Compression? 1. Lempel-Ziv type of encoding? 2. No guarantee that it will remove all repeats of length L − 1 , even when using “multiple passes.”

  24. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Inspiration 1. Wijngaarden and Immink, 2010, constructed maximum run-length limited codes based on a new technique called runlength replacement. We developed a new repeat replacement technique.

  25. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Inspiration 1. Wijngaarden and Immink, 2010, constructed maximum run-length limited codes based on a new technique called runlength replacement. We developed a new repeat replacement technique. 2. Key idea is to remove “short” repeated strings and insert metadata that allows for retrieving original string.

  26. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 .

  27. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 .

  28. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 . 3. Delete, starting from the left, runlengths of length exactly six. In the first step, this results in 010011101 .

  29. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 . 3. Delete, starting from the left, runlengths of length exactly six. In the first step, this results in 010011101 . 4. Append B ( i ) to the modified string, where B ( i ) ∈ { 0 , 1 } log n denotes the binary representation of the location integer i , and log n = ⌈ log n ⌉ .

  30. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 . 3. Delete, starting from the left, runlengths of length exactly six. In the first step, this results in 010011101 . 4. Append B ( i ) to the modified string, where B ( i ) ∈ { 0 , 1 } log n denotes the binary representation of the location integer i , and log n = ⌈ log n ⌉ . 5. Since i = 0 , we have B ( 0 ) = 0 4 and 0100111010000 , since log n = 4 .

  31. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 . 3. Delete, starting from the left, runlengths of length exactly six. In the first step, this results in 010011101 . 4. Append B ( i ) to the modified string, where B ( i ) ∈ { 0 , 1 } log n denotes the binary representation of the location integer i , and log n = ⌈ log n ⌉ . 5. Since i = 0 , we have B ( 0 ) = 0 4 and 0100111010000 , since log n = 4 . 6. Next, append 10 to arrive at code string 010011101000010 . The string 10 indicates that the substring immediately preceding it describes the location of a deleted zero runlength of length six.

  32. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Runlength Replacement 1. Let the information string be 00000001001110 , and impose a max zero runlength constraint r max = 5 . Focus on runlengths of length ≥ r max + 1 . 2. Append the substring 1 to the information string. This leads to the string 000000010011101 . 3. Delete, starting from the left, runlengths of length exactly six. In the first step, this results in 010011101 . 4. Append B ( i ) to the modified string, where B ( i ) ∈ { 0 , 1 } log n denotes the binary representation of the location integer i , and log n = ⌈ log n ⌉ . 5. Since i = 0 , we have B ( 0 ) = 0 4 and 0100111010000 , since log n = 4 . 6. Next, append 10 to arrive at code string 010011101000010 . The string 10 indicates that the substring immediately preceding it describes the location of a deleted zero runlength of length six. 7. Typical repeat length ∼ log n , and log n bits needed for B ( i ) . Code has one redundant bit for correct parameter choice n ≤ 2 k − 1 + k + 1 .

  33. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *).

  34. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 .

  35. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 . 3. Repeat removal (naive approach): 1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000 B ( 0 ) B ( 12 ) 01 .

  36. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 . 3. Repeat removal (naive approach): 1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000 B ( 0 ) B ( 12 ) 01 . 4. Note that len ( B ( i ) B ( i ′ ) 01 ) = 2 log n + 2 , and L ∼ 2 log n .

  37. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 . 3. Repeat removal (naive approach): 1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000 B ( 0 ) B ( 12 ) 01 . 4. Note that len ( B ( i ) B ( i ′ ) 01 ) = 2 log n + 2 , and L ∼ 2 log n . 5. Unlike runlength removal, where shortening a runlength of zeros does not produce a new runlength, removing a repeat may produce a new repeat.

  38. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 . 3. Repeat removal (naive approach): 1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000 B ( 0 ) B ( 12 ) 01 . 4. Note that len ( B ( i ) B ( i ′ ) 01 ) = 2 log n + 2 , and L ∼ 2 log n . 5. Unlike runlength removal, where shortening a runlength of zeros does not produce a new runlength, removing a repeat may produce a new repeat. 6. (*) Can repeat the removal iteratively, but how do we guarantee that the procedure terminates?

  39. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! What About Repeat Replacement? 1. All input sequences x ∈ { 0 , 1 } n − 2 are required to satisfy x L − 1 = 1 (see *). 2. Let L = 7 , x = 10111110000010111110000110000000 ; first repeat of length seven when scanned from left is 1011111 . 3. Repeat removal (naive approach): 1011111000001011111000010000000 → 101111100000000010000000 → 101111100000000010000000 B ( 0 ) B ( 12 ) 01 . 4. Note that len ( B ( i ) B ( i ′ ) 01 ) = 2 log n + 2 , and L ∼ 2 log n . 5. Unlike runlength removal, where shortening a runlength of zeros does not produce a new runlength, removing a repeat may produce a new repeat. 6. (*) Can repeat the removal iteratively, but how do we guarantee that the procedure terminates? 7. (*) Specialized procedure that ensures that at each round, the length of the input string is reduced by exactly one. Works if we remove L − 1 and add L − 2 = 2 log n + 2 bits back.

  40. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Repeat replacement encoder E RR for generating ( L − 1 ) -substring unique strings.

  41. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Repeat replacement encoder E RR for generating ( L − 1 ) -substring unique strings. 2. If x I is ( L − 1 ) -substring unique, set x = x I and STOP. Otherwise, set x ( 0 ) = x I , and let k = 1 .

  42. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Repeat replacement encoder E RR for generating ( L − 1 ) -substring unique strings. 2. If x I is ( L − 1 ) -substring unique, set x = x I and STOP. Otherwise, set x ( 0 ) = x I , and let k = 1 . 3. Suppose that x ( k − 1 ) has a repeat at positions ( i, j ) of length L − 1 . Let x ( k ) be obtained by deleting x ( k − 1 ) j,L − 1 from x ( k − 1 ) and appending B ( i ) B ( j ) 10 to the end of the generated string.

  43. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Repeat replacement encoder E RR for generating ( L − 1 ) -substring unique strings. 2. If x I is ( L − 1 ) -substring unique, set x = x I and STOP. Otherwise, set x ( 0 ) = x I , and let k = 1 . 3. Suppose that x ( k − 1 ) has a repeat at positions ( i, j ) of length L − 1 . Let x ( k ) be obtained by deleting x ( k − 1 ) j,L − 1 from x ( k − 1 ) and appending B ( i ) B ( j ) 10 to the end of the generated string. L − 1 = 0 , i.e., if the ( L − 1 ) -st bit of x ( k ) equals to 0 , reset x ( k ) L − 1 = 1 4. If x ( k ) and update the last two bits of x ( k ) to 11 . If x ( k ) is ( L − 1 ) -substring unique, set x = x ( k ) and STOP. Otherwise, set k = k + 1 , and go to Step 2.

  44. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Encoder E LR for an L -reconstruction code.

  45. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Encoder E LR for an L -reconstruction code. 2. Let x I ∈ { 0 , 1 } n − 2 . If the L − 1 th bit of x I is 1 , append 10 . Otherwise, set the L − 1 th bit of x I to 1 , and append 11 .

  46. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Algorithm 1. Encoder E LR for an L -reconstruction code. 2. Let x I ∈ { 0 , 1 } n − 2 . If the L − 1 th bit of x I is 1 , append 10 . Otherwise, set the L − 1 th bit of x I to 1 , and append 11 . 3. If ℓ ( E RR ( x I )) = n , set x = E RR ( x I ) . Otherwise, append to E RR ( x I ) as many zeros as needed to make the string x have length n .

  47. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Repeat Replacement for log n < L < 2 log n 1. Cannot afford encoding two positions for the two repeats.

  48. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Repeat Replacement for log n < L < 2 log n 1. Cannot afford encoding two positions for the two repeats. 2. Can we use only the encoding of the position of the original (first) repeat?

  49. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Repeat Replacement for log n < L < 2 log n 1. Cannot afford encoding two positions for the two repeats. 2. Can we use only the encoding of the position of the original (first) repeat? 3. Short markers at positions of repeats of length ∼ log log n . But how do we distinguish markers from information content and other markers?

  50. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Repeat Replacement for log n < L < 2 log n 1. Cannot afford encoding two positions for the two repeats. 2. Can we use only the encoding of the position of the original (first) repeat? 3. Short markers at positions of repeats of length ∼ log log n . But how do we distinguish markers from information content and other markers? 4. Perform runlength coding to remove runlengths exceeding length ∼ log log n . Then use zero-runs of that length+1 as markers.

  51. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Additional Results 1. Repeat replacement in the presence of erroneous and missing substrings.

  52. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Additional Results 1. Repeat replacement in the presence of erroneous and missing substrings. 2. Interesting new technique for ensuring code strings have substrings of length ˆ L at sufficiently large Hamming distance and of sufficiently large weight.

  53. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Additional Results 1. Repeat replacement in the presence of erroneous and missing substrings. 2. Interesting new technique for ensuring code strings have substrings of length ˆ L at sufficiently large Hamming distance and of sufficiently large weight. 3. For details, see: Kiah, Puleo, M, IT 2016, DNA Profile Codes, Gabrys, M, ISIT 2018, and arxive preprint Unique Reconstruction of Coded Strings from Multiset Substring Spectra at https://arxiv.org/abs/1804.04548

  54. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Subsequences

  55. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The k -Deck Problem 1. The k -deck for x ∈ { 0 , 1 } n is the multiset of all subsequences of x of length k .

  56. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The k -Deck Problem 1. The k -deck for x ∈ { 0 , 1 } n is the multiset of all subsequences of x of length k . 2. The k -deck problem asks one to determine the minimum value of k , denoted by f ( n ) , necessary to uniquely reconstruct any x ∈ { 0 , 1 } n .

  57. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The k -Deck Problem 1. The k -deck for x ∈ { 0 , 1 } n is the multiset of all subsequences of x of length k . 2. The k -deck problem asks one to determine the minimum value of k , denoted by f ( n ) , necessary to uniquely reconstruct any x ∈ { 0 , 1 } n . 3. Kalashnik, 1973, introduced the k -deck problem.

  58. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The k -Deck Problem 1. The k -deck for x ∈ { 0 , 1 } n is the multiset of all subsequences of x of length k . 2. The k -deck problem asks one to determine the minimum value of k , denoted by f ( n ) , necessary to uniquely reconstruct any x ∈ { 0 , 1 } n . 3. Kalashnik, 1973, introduced the k -deck problem. √ n . 4. Krasikov and Roditty, 1997, showed that f ( n ) ≥ ( 1 + o ( 1 )) 16 7

  59. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The k -Deck Problem 1. The k -deck for x ∈ { 0 , 1 } n is the multiset of all subsequences of x of length k . 2. The k -deck problem asks one to determine the minimum value of k , denoted by f ( n ) , necessary to uniquely reconstruct any x ∈ { 0 , 1 } n . 3. Kalashnik, 1973, introduced the k -deck problem. √ n . 4. Krasikov and Roditty, 1997, showed that f ( n ) ≥ ( 1 + o ( 1 )) 16 7 5. Dudik and Schulman, 2003, proved that f ( n ) ≥ exp ( Ω ( log 1 / 2 n )) . 6. Scott, 2011, proved that f ( n ) ≤ ( 1 + o ( 1 )) √ n log n .

  60. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Trace Reconstruction Problem 1. One is given a collection of m random subsequences (traces) of a string of length n , obtained by an iid deletion process with probability q . When is unique reconstruction possible?

  61. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Trace Reconstruction Problem 1. One is given a collection of m random subsequences (traces) of a string of length n , obtained by an iid deletion process with probability q . When is unique reconstruction possible? 2. Problem: determine achievable ( m, q ) pairs for a given n .

  62. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Trace Reconstruction Problem 1. One is given a collection of m random subsequences (traces) of a string of length n , obtained by an iid deletion process with probability q . When is unique reconstruction possible? 2. Problem: determine achievable ( m, q ) pairs for a given n . 3. Introduced by Batu et al, 2004, who showed that:

  63. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Trace Reconstruction Problem 1. One is given a collection of m random subsequences (traces) of a string of length n , obtained by an iid deletion process with probability q . When is unique reconstruction possible? 2. Problem: determine achievable ( m, q ) pairs for a given n . 3. Introduced by Batu et al, 2004, who showed that: 3.1 For random strings, ( m ∼ log n, q ∼ log n ) is achievable. Since, improved 1 to ( m = exp ( O ( log 1 / 3 n )) , q < 1 2 ) by Holden et al, 2018.

  64. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Trace Reconstruction Problem 1. One is given a collection of m random subsequences (traces) of a string of length n , obtained by an iid deletion process with probability q . When is unique reconstruction possible? 2. Problem: determine achievable ( m, q ) pairs for a given n . 3. Introduced by Batu et al, 2004, who showed that: 3.1 For random strings, ( m ∼ log n, q ∼ log n ) is achievable. Since, improved 1 to ( m = exp ( O ( log 1 / 3 n )) , q < 1 2 ) by Holden et al, 2018. 3.2 For arbitrary strings, ( m = O ( n log n ) , q = ( n 1 / 2 + ǫ )) , ǫ > 0 , is achievable. 1 Since, improved to ( m = exp ( O ( n 1 / 3 )) , q < 1 2 ) by De et al. and Nazarov and Perez, 2018.

  65. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Simplest Trace Reconstruction Algorithm 1. Align sequence on the left. Find majority bit value at the first position: 0 0 1 0 0 1 1 1 ? 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 ? ? ? 0 1 0 0 1 1 1 1 1 0

  66. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Simplest Trace Reconstruction Algorithm 1. Align sequence on the left. Find majority bit value at the first position: 0 0 1 0 0 1 1 1 ? 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 ? ? ? 0 1 0 0 1 1 1 1 1 0 2. Shift traces in disagreement with the majority to the right. Find majority bit value at the second position. Break ties arbitrarily: 0 0 1 0 0 1 1 1 ? x 1 0 0 1 0 0 1 1 0 0 1 1 1 0 ? ? ? 0 1 0 0 1 1 1 1 1 0 1

  67. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! The Simplest Trace Reconstruction Algorithm 1. Align sequence on the left. Find majority bit value at the first position: 0 0 1 0 0 1 1 1 ? 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 ? ? ? 0 1 0 0 1 1 1 1 1 0 2. Shift traces in disagreement with the majority to the right. Find majority bit value at the second position. Break ties arbitrarily: 0 0 1 0 0 1 1 1 ? x 1 0 0 1 0 0 1 1 0 0 1 1 1 0 ? ? ? 0 1 0 0 1 1 1 1 1 0 1 3. Iterate.

  68. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Coded k -Deck and Trace Reconstruction? 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction problem.

  69. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Coded k -Deck and Trace Reconstruction? 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction problem. 2. Empirical evidence that string balancing helps reducing the number of traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College.

  70. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Coded k -Deck and Trace Reconstruction? 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction problem. 2. Empirical evidence that string balancing helps reducing the number of traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College. 3. Sala and Dolecek, 2016, showed that if x is a codeword of a ( ℓ − 1 ) insertion/deletion correcting code, and each channel introduces at most t ≥ ℓ errors, then one needs at most j )( t + j − i )( n + t ( − 1 ) t + j − i ( 2 j t t − j ) ∑ ∑ 2 j i j = ℓ i = 0 traces for exact reconstruction.

  71. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Coded k -Deck and Trace Reconstruction? 1. Levenshtein, 1992, introduced an instance of a coded trace reconstruction problem. 2. Empirical evidence that string balancing helps reducing the number of traces and the value of k needed for reconstruction (Yazdi, Gabrys, M, 2016), current collaboration with Cheragchi group, Imperial College. 3. Sala and Dolecek, 2016, showed that if x is a codeword of a ( ℓ − 1 ) insertion/deletion correcting code, and each channel introduces at most t ≥ ℓ errors, then one needs at most j )( t + j − i )( n + t ( − 1 ) t + j − i ( 2 j t t − j ) ∑ ∑ 2 j i j = ℓ i = 0 traces for exact reconstruction. 4. Nothing known about coded k -deck reconstruction (yet)!

  72. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! A Practical Problem 1. With nanopore sequencers, long traces are produced early during the sequencing process. After a certain point, the pores “tire” and only fairly short traces may be retrieved.

  73. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! A Practical Problem 1. With nanopore sequencers, long traces are produced early during the sequencing process. After a certain point, the pores “tire” and only fairly short traces may be retrieved. 2. How long should one run the sequencer so that unique reconstruction is possible?

  74. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Mathematical Formulation Wish to reconstruct x ∈ { 0 , 1 } n given the following:

  75. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Mathematical Formulation Wish to reconstruct x ∈ { 0 , 1 } n given the following: 1. A set U ∈ { 0 , 1 } n − t of M subsequences of x obtained by deleting up to t bits, and

  76. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Mathematical Formulation Wish to reconstruct x ∈ { 0 , 1 } n given the following: 1. A set U ∈ { 0 , 1 } n − t of M subsequences of x obtained by deleting up to t bits, and 2. The multiset of all length- k subsequences (the k -deck) of x .

  77. Multiomics Data and String Reconstruction Substrings Subsequences Substring Composition THANK YOU! Mathematical Formulation Wish to reconstruct x ∈ { 0 , 1 } n given the following: 1. A set U ∈ { 0 , 1 } n − t of M subsequences of x obtained by deleting up to t bits, and 2. The multiset of all length- k subsequences (the k -deck) of x . The set U represents the long traces while the k -deck is the set of short traces.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend