longest near palindrome under
play

Longest Near-Palindrome under Hamming Distance Elena Grigorescu, - PowerPoint PPT Presentation

Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University Structure of Talk Background 1-Pass Additive Algorithm


  1. Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University

  2. Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

  3. FSTTCSIITKANPURPATTERNINDIAP Finding Structure in ALPPATTERNSZXAITIIKIICU JBFQWA Noisy Data FSTTCSPATTERNIITKANPURINDIAO STREAMINGALGORITHMPATTERNU PERIODPERIODPERIODPERIODPER FSTTCSTHEORYCSASBRICBCAUON LONGPALINDROMEEMORDNILAPGN OLFSTTCSIITKANPURINDIAGENXAS

  4. Palindrome ❖ A string that reads the same forwards and backwards ❖ 𝑇 = 𝑇 𝑆 ❖ RACECAR ❖ RACECAR ❖ AIBOHPHOBIA ❖ AIBOHPHOBIA

  5. 𝑒 -Near-Palindrome ❖ A string that “almost” reads the same forwards and backwards ❖ Given a metric 𝑒𝑗𝑡𝑢 , a 𝑒 -near-palindrome has 𝑒𝑗𝑡𝑢 𝑇, 𝑇 𝑆 ≤ 𝑒 . ❖ RACECAR ❖ FACECAR

  6. Hamming Distance ❖ Given strings 𝑌, 𝑍, the Hamming distance between 𝑌 and 𝑍 is defined as the positions 𝑗 at which 𝑌 𝑗 ≠ 𝑍 𝑗 . ❖ 𝑇 = FACECAR ❖ 𝑇 𝑆 = RACECAF ❖ HAM(𝑇, 𝑇 𝑆 ) = 2

  7. Streaming Model ❖ String of length 𝑜 arrives one symbol at a time ❖ Use 𝑝(𝑜) space, ideally 𝑃(𝑞𝑝𝑚𝑧𝑚𝑝𝑕 𝑜) abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb

  8. Longest 𝑒 -Near-Palindrome Problem ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, identify the longest 𝑒 -near-palindrome in space 𝑝 𝑜 . ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, find a “long” 𝑒 -near-palindrome in space 𝑝 𝑜 .

  9. Applications TGCTTAAGCGCTTGCAAGCGCTTAAGCA CAAGCGCTTAAGCA ACGAATTCGCGAAC

  10. Related Work (Palindromes in Data Streams) ❖ 𝑃(log 𝑜) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest palindrome (Berenbrink,Ergün,Mallmann- Trenn,Sadeqi Azer ‘14) ❖ 𝑃( 𝑜) space to provide a 𝑜 additive approximation to the length of the longest palindrome (BEMS14) ❖ 𝑃( 𝑜) space to find the longest palindrome in two passes (BEMS14) log 𝑜 ❖ Ω 𝜁 log(1+𝜁) space for 1 + 𝜁 multiplicative approximation (Gawrychowski,Merkurev,Shur,Uznanski’16) 𝑜 ❖ Ω 𝐹 space for 𝐹 additive approximation (GMSU16)

  11. Our Results 𝑒 log 7 𝑜 ❖ 𝑃 𝜁 log(1+𝜁) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 𝑜 log 6 𝑜) space to provide a 𝑜 additive approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 2 𝑜 log 6 𝑜) space to find the longest 𝑒 -near-palindrome in two passes ❖ Ω 𝑒 log 𝑜 space LB for 1 + 𝜁 multiplicative approximation 𝑒𝑜 ❖ Ω space LB for 𝐹 additive approximation 𝐹

  12. Comparison Longest 𝑒 -Near- Longest Palindrome Palindrome (Here) 𝑃(log 2 𝑜) (BEMS14) 𝑒 log 7 𝑜 1 + 𝜁 multiplicative 𝑃 𝜁 log(1 + 𝜁) 𝑃(𝑒 𝑜 log 6 𝑜) 𝑜 additive 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒 2 𝑜 log 6 𝑜) two pass exact 𝑃( 𝑜 log 𝑜) (BEMS14) log 𝑜 1 + 𝜁 multiplicative LB Ω Ω 𝑒 log 𝑜 log(1+𝜁) (GMSU16) 𝑜 E additive LB Ω 𝑒𝑜 Ω 𝐹 (GMSU16) 𝐹

  13. Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

  14. Warm-Up ❖ Suppose we see string 𝑇 , followed by string 𝑈 . How can we determine if 𝑇 = 𝑈 , with high probability?

  15. Karp-Rabin Fingerprints 𝑜 ❖ Given base 𝐶 and a prime 𝑄 , define 𝜚 𝑇 = σ 𝑗=1 𝐶 𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 ❖ If 𝑇 = 𝑈 , then 𝜚 𝑇 = 𝜚 𝑈 ❖ If 𝑇 ≠ 𝑈 , then 𝜚 𝑇 ≠ 𝜚 𝑈 w.h.p. (Schwartz-Zippel)

  16. Properties of Karp-Rabin Fingerprints ❖ 𝜚 𝑇[1: 𝑧] = 𝜚 𝑇[1: 𝑦] + 𝐶 𝑦 𝜚 𝑇[𝑦: 𝑧] (concatenation) ❖ Define 𝜚 𝑆 𝑇 = σ 𝑗=1 𝑜 𝐶 −𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 (reversal) ❖ 𝜚 𝑇 𝑆 [1: 𝑦] = 𝐶 𝑦+1 𝜚 𝑆 𝑇[1: 𝑦] ❖ 𝜚 𝑆 𝑇[1: 𝑧] = 𝜚 𝑆 𝑇[1: 𝑦] + 𝐶 −𝑦 𝜚 𝑆 𝑇[𝑦: 𝑧] ❖ Can be computed on the fly

  17. Identifying Palindromes ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  18. Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  19. Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  20. Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  21. Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  22. Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

  23. Identifying Near-Palindromes? (CFP+16)

  24. Karp-Rabin Fingerprints for Subpatterns ❖ 𝑇 𝑏,𝑐 = 𝑇 𝑏 𝑇 𝑏 + 𝑐 𝑇 𝑏 + 2𝑐 𝑇 𝑏 + 3𝑐 … ❖ 𝜚 𝑏,𝑐 𝑇 = 𝜚 𝑇 𝑏,𝑐 = 𝐶 ∗ 𝑇 𝑏 + 𝐶 2 ∗ 𝑇 𝑏 + 𝑐 + 𝐶 3 ∗ 𝑇 𝑏 + 2𝑐 …

  25. Identifying Near-Palindromes? 𝑆 ❖ Let ∆ = # 𝑏 𝜚 𝑏,𝑐 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑐 𝑇 } ❖ Then ∆ ≤ HAM(𝑇, 𝑇 𝑆 )

  26. Identifying Near-Palindromes? ❖ Sample log 𝑜 primes 𝑞 1 , 𝑞 2 , … from 16 𝑒 log 2 𝑜, 544 𝑒 log 2 𝑜 . ❖ Let ∆ = max # 𝑏 𝜚 𝑏,𝑞 𝑗 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑞 𝑗 𝑆 𝑇 } ❖ ∆ ≤ HAM(𝑇, 𝑇 𝑆 ) What about ❖ If HAM 𝑇, 𝑇 𝑆 > 2𝑒 , then ∆ > 1 + 1 16 𝑒 w.h.p. (CFP+16) HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 ?

  27. Karp-Rabin Fingerprints for Sub-Subpatterns

  28. Second-Level Karp-Rabin Fingerprints ❖ Call a mismatch isolated under 𝑞 𝑗 if it is the only mismatch under some subpattern 𝑇 𝑏,𝑞 𝑗 . Let 𝐽 be the number of isolated mismatches. ❖ If HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 , then 𝐽 = HAM 𝑇, 𝑇 𝑆 w.h.p. (CFP+16)

  29. In Review ❖ There exists a data structure of size 𝑃 𝑒 log 6 𝑜 bits that recognizes whether HAM 𝑇, 𝑇 𝑆 ≤ 𝑒 w.h.p. ❖ Recently, this has been improved to 𝑃 𝑒 log 𝑜 . (Clifford, Kociumaka, Porat ‘17) ❖ Through black-box reduction, improves our results by 𝑃 log 5 𝑜 .

  30. Additive Error Algorithm 𝑜 ❖ Initialize a data structure every 2 positions!

  31. Additive Error Algorithm ❖ 2 𝑜 sketches, each of size 𝑃 𝑒 log 6 𝑜 bits ❖ Total space: 𝑃 𝑒 𝑜 log 6 𝑜 bits

  32. Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

  33. 2-Pass Exact Algorithm ❖ Can we modify 1-pass additive algorithm to 2-pass exact? ❖ Missing characters before checkpoint!

  34. 2-Pass Exact Algorithm ❖ Idea: keep all characters before each checkpoint in the second pass ❖ What if there are Ω 𝑜 candidates? ❖ Structural result of palindromes (BEMS14)

  35. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  36. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  37. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  38. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  39. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  40. Structural Result of Near-Palindromes

  41. Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

  42. Structural Result of Near-Palindromes ❖ Not quite periodic (at most 2𝑒 − 1 different words) ❖ Need to save at most 2𝑒 − 1 fingerprints of words

  43. 2-Pass Exact Algorithm ❖ First pass: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits ❖ At most 2𝑒 − 1 fingerprints, each of size 𝑃 𝑒 log 6 𝑜 words ❖ Need to save at 𝑜 characters before 2𝑒 − 1 checkpoints: 𝑃 𝑒 𝑜 ❖ Total space: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits

  44. Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

  45. Multiplicative Lower Bounds ❖ Yao’s Principle: find “hard” distribution for deterministic algorithms ❖ Let 𝜉 be the prefix of 10110011100011110000 … = 1 1 0 1 1 2 0 2 … 𝑜 of length 4 (GMSU16). 𝑜 ❖ Take 𝑦 ∈ 𝑌 = strings of length 4 with weight 𝑒 ❖ Take 𝑧 ∈ 𝑍 = 𝑧 | HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1 ❖ Define 𝑡 𝑦, 𝑧 = 𝜉 𝑆 𝑦𝑧 𝑆 𝜉 .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend