SLIDE 1 Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance
Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University
SLIDE 2
Structure of Talk
❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
SLIDE 3
FSTTCSIITKANPURPATTERNINDIAP ALPPATTERNSZXAITIIKIICU JBFQWA FSTTCSPATTERNIITKANPURINDIAO STREAMINGALGORITHMPATTERNU PERIODPERIODPERIODPERIODPER FSTTCSTHEORYCSASBRICBCAUON LONGPALINDROMEEMORDNILAPGN OLFSTTCSIITKANPURINDIAGENXAS Finding Structure in Noisy Data
SLIDE 4
Palindrome
❖ A string that reads the same forwards and backwards ❖ 𝑇 = 𝑇𝑆 ❖ RACECAR ❖ RACECAR ❖ AIBOHPHOBIA ❖ AIBOHPHOBIA
SLIDE 5
𝑒-Near-Palindrome
❖ A string that “almost” reads the same forwards and backwards ❖ Given a metric 𝑒𝑗𝑡𝑢, a 𝑒-near-palindrome has 𝑒𝑗𝑡𝑢 𝑇, 𝑇𝑆 ≤ 𝑒. ❖ RACECAR ❖ FACECAR
SLIDE 6 Hamming Distance
❖ Given strings 𝑌, 𝑍, the Hamming distance between 𝑌 and 𝑍 is defined as the positions 𝑗 at which 𝑌𝑗 ≠ 𝑍
𝑗.
❖ 𝑇 = FACECAR ❖ 𝑇𝑆 = RACECAF ❖ HAM(𝑇, 𝑇𝑆) = 2
SLIDE 7
Streaming Model
❖ String of length 𝑜 arrives one symbol at a time ❖ Use 𝑝(𝑜) space, ideally 𝑃(𝑞𝑝𝑚𝑧𝑚𝑝 𝑜) abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb
SLIDE 8
Longest 𝑒-Near-Palindrome Problem
❖ Given a string 𝑇 of length 𝑜, which arrives in a data stream, identify the longest 𝑒-near-palindrome in space 𝑝 𝑜 . ❖ Given a string 𝑇 of length 𝑜, which arrives in a data stream, find a “long” 𝑒-near-palindrome in space 𝑝 𝑜 .
SLIDE 9
Applications
TGCTTAAGCGCTTGCAAGCGCTTAAGCA CAAGCGCTTAAGCA ACGAATTCGCGAAC
SLIDE 10 Related Work (Palindromes in Data Streams)
❖ 𝑃(log 𝑜) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest palindrome (Berenbrink,Ergün,Mallmann- Trenn,Sadeqi Azer ‘14) ❖ 𝑃( 𝑜) space to provide a 𝑜 additive approximation to the length
- f the longest palindrome (BEMS14)
❖ 𝑃( 𝑜) space to find the longest palindrome in two passes (BEMS14) ❖ Ω
log 𝑜 𝜁 log(1+𝜁) space for 1 + 𝜁 multiplicative approximation
(Gawrychowski,Merkurev,Shur,Uznanski’16) ❖ Ω
𝑜 𝐹 space for 𝐹 additive approximation (GMSU16)
SLIDE 11 Our Results
❖ 𝑃
𝑒 log7 𝑜 𝜁 log(1+𝜁) space to provide a 1 + 𝜁 multiplicative
approximation to the length of the longest 𝑒-near-palindrome ❖ 𝑃(𝑒 𝑜 log6 𝑜) space to provide a 𝑜 additive approximation to the length of the longest 𝑒-near-palindrome ❖ 𝑃(𝑒2 𝑜 log6 𝑜) space to find the longest 𝑒-near-palindrome in two passes ❖ Ω 𝑒 log 𝑜 space LB for 1 + 𝜁 multiplicative approximation ❖ Ω
𝑒𝑜 𝐹
space LB for 𝐹 additive approximation
SLIDE 12 Comparison
Longest Palindrome Longest 𝑒-Near- Palindrome (Here) 1 + 𝜁 multiplicative 𝑃(log2 𝑜) (BEMS14) 𝑃 𝑒 log7 𝑜 𝜁 log(1 + 𝜁) 𝑜 additive 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒 𝑜 log6 𝑜) two pass exact 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒2 𝑜 log6 𝑜) 1 + 𝜁 multiplicative LB Ω
log 𝑜 log(1+𝜁) (GMSU16)
Ω 𝑒 log 𝑜 E additive LB Ω
𝑜 𝐹 (GMSU16)
Ω 𝑒𝑜 𝐹
SLIDE 13
Structure of Talk
❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
SLIDE 14
Warm-Up
❖ Suppose we see string 𝑇, followed by string 𝑈. How can we determine if 𝑇 = 𝑈, with high probability?
SLIDE 15 Karp-Rabin Fingerprints
❖ Given base 𝐶 and a prime 𝑄, define 𝜚 𝑇 = σ𝑗=1
𝑜
𝐶𝑗𝑇 𝑗 𝑛𝑝𝑒 𝑄 ❖ If 𝑇 = 𝑈, then 𝜚 𝑇 = 𝜚 𝑈 ❖ If 𝑇 ≠ 𝑈, then 𝜚 𝑇 ≠ 𝜚 𝑈 w.h.p. (Schwartz-Zippel)
SLIDE 16 Properties of Karp-Rabin Fingerprints
❖ 𝜚 𝑇[1: 𝑧] = 𝜚 𝑇[1: 𝑦] + 𝐶𝑦𝜚 𝑇[𝑦: 𝑧] (concatenation) ❖ Define 𝜚𝑆 𝑇 = σ𝑗=1
𝑜
𝐶−𝑗𝑇 𝑗 𝑛𝑝𝑒 𝑄 (reversal) ❖ 𝜚 𝑇𝑆[1: 𝑦] = 𝐶𝑦+1𝜚𝑆 𝑇[1: 𝑦] ❖ 𝜚𝑆 𝑇[1: 𝑧] = 𝜚𝑆 𝑇[1: 𝑦] + 𝐶−𝑦𝜚𝑆 𝑇[𝑦: 𝑧] ❖ Can be computed on the fly
SLIDE 17
Identifying Palindromes
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 18
Identifying Near-Palindromes?
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 19
Identifying Near-Palindromes?
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 20
Identifying Near-Palindromes?
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 21
Identifying Near-Palindromes?
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 22
Identifying Near-Palindromes?
❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
SLIDE 23
Identifying Near-Palindromes? (CFP+16)
SLIDE 24
Karp-Rabin Fingerprints for Subpatterns
❖ 𝑇𝑏,𝑐 = 𝑇 𝑏 𝑇 𝑏 + 𝑐 𝑇 𝑏 + 2𝑐 𝑇 𝑏 + 3𝑐 … ❖ 𝜚𝑏,𝑐 𝑇 = 𝜚 𝑇𝑏,𝑐 = 𝐶 ∗ 𝑇 𝑏 + 𝐶2 ∗ 𝑇 𝑏 + 𝑐 + 𝐶3 ∗ 𝑇 𝑏 + 2𝑐 …
SLIDE 25 Identifying Near-Palindromes?
❖ Let ∆ = # 𝑏 𝜚𝑏,𝑐 𝑇 ≠ 𝐶𝑙𝜚𝑏,𝑐
𝑆
𝑇 } ❖ Then ∆ ≤ HAM(𝑇, 𝑇𝑆)
SLIDE 26 Identifying Near-Palindromes?
❖ Sample log 𝑜 primes 𝑞1, 𝑞2, … from 16 𝑒 log2 𝑜, 544 𝑒 log2 𝑜 . ❖ Let ∆ = max # 𝑏 𝜚𝑏,𝑞𝑗 𝑇 ≠ 𝐶𝑙𝜚𝑏,𝑞𝑗
𝑆
𝑇 } ❖ ∆ ≤ HAM(𝑇, 𝑇𝑆) ❖ If HAM 𝑇, 𝑇𝑆 > 2𝑒, then ∆ > 1 +
1 16 𝑒 w.h.p. (CFP+16)
What about HAM 𝑇, 𝑇𝑆 ≤ 2𝑒?
SLIDE 27
Karp-Rabin Fingerprints for Sub-Subpatterns
SLIDE 28
Second-Level Karp-Rabin Fingerprints
❖ Call a mismatch isolated under 𝑞𝑗 if it is the only mismatch under some subpattern 𝑇𝑏,𝑞𝑗. Let 𝐽 be the number of isolated mismatches. ❖ If HAM 𝑇, 𝑇𝑆 ≤ 2𝑒, then 𝐽 = HAM 𝑇, 𝑇𝑆 w.h.p. (CFP+16)
SLIDE 29
In Review
❖ There exists a data structure of size 𝑃 𝑒 log6 𝑜 bits that recognizes whether HAM 𝑇, 𝑇𝑆 ≤ 𝑒 w.h.p. ❖ Recently, this has been improved to 𝑃 𝑒 log 𝑜 . (Clifford, Kociumaka, Porat ‘17) ❖ Through black-box reduction, improves our results by 𝑃 log5 𝑜 .
SLIDE 30 Additive Error Algorithm
❖ Initialize a data structure every
𝑜 2 positions!
SLIDE 31
Additive Error Algorithm
❖ 2 𝑜 sketches, each of size 𝑃 𝑒 log6 𝑜 bits ❖ Total space: 𝑃 𝑒 𝑜 log6 𝑜 bits
SLIDE 32
Structure of Talk
❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
SLIDE 33
2-Pass Exact Algorithm
❖ Can we modify 1-pass additive algorithm to 2-pass exact? ❖ Missing characters before checkpoint!
SLIDE 34
2-Pass Exact Algorithm
❖ Idea: keep all characters before each checkpoint in the second pass ❖ What if there are Ω 𝑜 candidates? ❖ Structural result of palindromes (BEMS14)
SLIDE 35
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 36
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 37
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 38
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 39
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 40
Structural Result of Near-Palindromes
SLIDE 41
Structural Result of Near-Palindromes
❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression
SLIDE 42
Structural Result of Near-Palindromes
❖ Not quite periodic (at most 2𝑒 − 1 different words) ❖ Need to save at most 2𝑒 − 1 fingerprints of words
SLIDE 43
2-Pass Exact Algorithm
❖ First pass: 𝑃 𝑒2 𝑜 log6 𝑜 bits ❖ At most 2𝑒 − 1 fingerprints, each of size 𝑃 𝑒 log6 𝑜 words ❖ Need to save at 𝑜 characters before 2𝑒 − 1 checkpoints: 𝑃 𝑒 𝑜 ❖ Total space: 𝑃 𝑒2 𝑜 log6 𝑜 bits
SLIDE 44
Structure of Talk
❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
SLIDE 45 Multiplicative Lower Bounds
❖ Yao’s Principle: find “hard” distribution for deterministic algorithms ❖ Let 𝜉 be the prefix of 10110011100011110000 … = 11011202 …
𝑜 4 (GMSU16).
❖ Take 𝑦 ∈ 𝑌 = strings of length
𝑜 4 with weight 𝑒
❖ Take 𝑧 ∈ 𝑍 = 𝑧 | HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1 ❖ Define 𝑡 𝑦, 𝑧 = 𝜉𝑆𝑦𝑧𝑆𝜉.
SLIDE 46 Multiplicative Lower Bounds
YES: If HAM 𝑦, 𝑧 ≤ 𝑒, then the longest 𝑒-near-palindrome of 𝑡 𝑦, 𝑧 has length 𝑜. NO: If HAM 𝑦, 𝑧 > 𝑒, then the longest 𝑒-near-palindrome of 𝑡 𝑦, 𝑧 has length at most 200𝑒2 +
𝑜 2.
SLIDE 47
Multiplicative Lower Bounds
❖ A 1 + 𝜁 multiplicative algorithm differentiates whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒. ❖ Just need to show cannot differentiate whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 in 𝑝(𝑒 log 𝑜) space!
SLIDE 48 Multiplicative Lower Bounds
❖ Save 𝑦 in
𝑒 log 𝑜 3
bits. ❖ Since 𝑦 ∈ 𝑌 = strings of length
𝑜 4 with weight 𝑒 , there are |𝑌| 4
pairs (𝑦, 𝑦′) which are mapped to the same configuration.
SLIDE 49 Multiplicative Lower Bounds
❖ Let 𝐽 be the set of indices for which 𝑦𝑗 = 1 or 𝑦𝑗
′ = 1
❖ Suppose HAM 𝑦, 𝑧 = 𝑒 but 𝑧 does not differ from 𝑦 in 𝐽 ❖ 𝑦: 101100000010001000000100100000 ❖ 𝑦’: 100000010010101000000100100000 ❖ 𝑧: 111101100010001011100100100010 ❖ Then HAM 𝑦′, 𝑧 > 𝑒! ❖ Errs on either 𝑡 𝑦, 𝑧 or 𝑡 𝑦′, 𝑧 .
???
SLIDE 50 Multiplicative Lower Bounds
❖ There are
|𝑌| 4 values of 𝑦 mapped to the wrong configuration, each
with
𝑜 4 − 2𝑒
𝑒
values of 𝑧, where algorithm is incorrect. ❖ Probability of failure:
|𝑌| 4
𝑜 4 − 2𝑒 𝑒
𝑌 |𝑍| ≥ 1 𝑜
SLIDE 51 In Review
❖ Provided a distribution over which any deterministic algorithm with 𝑝(𝑒 log 𝑜) bits fails to distinguish HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 at least
1 𝑜 of the time
❖ A 1 + 𝜁 multiplicative algorithm differentiates whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 ❖ Showed every deterministic algorithm fails over random inputs
SLIDE 52 Additive Lower Bounds
❖ Define 𝑡 𝑦, 𝑧 = 1𝐹𝑦11
𝐹 𝑒𝑦21 𝐹 𝑒𝑦3 … 𝑦𝑜′ 2
𝑧𝑜′
2
… 𝑧31
𝐹 𝑒𝑧21 𝐹 𝑒𝑧11𝐹
❖ Take 𝑦 ∈ 𝑌 = all strings of length
𝑜′ 2
❖ Take 𝑧 ∈ 𝑍 = HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1
SLIDE 53
Open Problems
❖ Can we find the longest 𝑒-near-palindrome in the edit distance? ❖ Longest palindromic subsequence
SLIDE 54