Longest Near-Palindrome under Hamming Distance Elena Grigorescu, - - PowerPoint PPT Presentation

longest near palindrome under
SMART_READER_LITE
LIVE PREVIEW

Longest Near-Palindrome under Hamming Distance Elena Grigorescu, - - PowerPoint PPT Presentation

Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University Structure of Talk Background 1-Pass Additive Algorithm


slide-1
SLIDE 1

Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance

Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University

slide-2
SLIDE 2

Structure of Talk

❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

slide-3
SLIDE 3

FSTTCSIITKANPURPATTERNINDIAP ALPPATTERNSZXAITIIKIICU JBFQWA FSTTCSPATTERNIITKANPURINDIAO STREAMINGALGORITHMPATTERNU PERIODPERIODPERIODPERIODPER FSTTCSTHEORYCSASBRICBCAUON LONGPALINDROMEEMORDNILAPGN OLFSTTCSIITKANPURINDIAGENXAS Finding Structure in Noisy Data

slide-4
SLIDE 4

Palindrome

❖ A string that reads the same forwards and backwards ❖ 𝑇 = 𝑇𝑆 ❖ RACECAR ❖ RACECAR ❖ AIBOHPHOBIA ❖ AIBOHPHOBIA

slide-5
SLIDE 5

𝑒-Near-Palindrome

❖ A string that “almost” reads the same forwards and backwards ❖ Given a metric 𝑒𝑗𝑡𝑢, a 𝑒-near-palindrome has 𝑒𝑗𝑡𝑢 𝑇, 𝑇𝑆 ≤ 𝑒. ❖ RACECAR ❖ FACECAR

slide-6
SLIDE 6

Hamming Distance

❖ Given strings 𝑌, 𝑍, the Hamming distance between 𝑌 and 𝑍 is defined as the positions 𝑗 at which 𝑌𝑗 ≠ 𝑍

𝑗.

❖ 𝑇 = FACECAR ❖ 𝑇𝑆 = RACECAF ❖ HAM(𝑇, 𝑇𝑆) = 2

slide-7
SLIDE 7

Streaming Model

❖ String of length 𝑜 arrives one symbol at a time ❖ Use 𝑝(𝑜) space, ideally 𝑃(𝑞𝑝𝑚𝑧𝑚𝑝𝑕 𝑜) abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb

slide-8
SLIDE 8

Longest 𝑒-Near-Palindrome Problem

❖ Given a string 𝑇 of length 𝑜, which arrives in a data stream, identify the longest 𝑒-near-palindrome in space 𝑝 𝑜 . ❖ Given a string 𝑇 of length 𝑜, which arrives in a data stream, find a “long” 𝑒-near-palindrome in space 𝑝 𝑜 .

slide-9
SLIDE 9

Applications

TGCTTAAGCGCTTGCAAGCGCTTAAGCA CAAGCGCTTAAGCA ACGAATTCGCGAAC

slide-10
SLIDE 10

Related Work (Palindromes in Data Streams)

❖ 𝑃(log 𝑜) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest palindrome (Berenbrink,Ergün,Mallmann- Trenn,Sadeqi Azer ‘14) ❖ 𝑃( 𝑜) space to provide a 𝑜 additive approximation to the length

  • f the longest palindrome (BEMS14)

❖ 𝑃( 𝑜) space to find the longest palindrome in two passes (BEMS14) ❖ Ω

log 𝑜 𝜁 log(1+𝜁) space for 1 + 𝜁 multiplicative approximation

(Gawrychowski,Merkurev,Shur,Uznanski’16) ❖ Ω

𝑜 𝐹 space for 𝐹 additive approximation (GMSU16)

slide-11
SLIDE 11

Our Results

❖ 𝑃

𝑒 log7 𝑜 𝜁 log(1+𝜁) space to provide a 1 + 𝜁 multiplicative

approximation to the length of the longest 𝑒-near-palindrome ❖ 𝑃(𝑒 𝑜 log6 𝑜) space to provide a 𝑜 additive approximation to the length of the longest 𝑒-near-palindrome ❖ 𝑃(𝑒2 𝑜 log6 𝑜) space to find the longest 𝑒-near-palindrome in two passes ❖ Ω 𝑒 log 𝑜 space LB for 1 + 𝜁 multiplicative approximation ❖ Ω

𝑒𝑜 𝐹

space LB for 𝐹 additive approximation

slide-12
SLIDE 12

Comparison

Longest Palindrome Longest 𝑒-Near- Palindrome (Here) 1 + 𝜁 multiplicative 𝑃(log2 𝑜) (BEMS14) 𝑃 𝑒 log7 𝑜 𝜁 log(1 + 𝜁) 𝑜 additive 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒 𝑜 log6 𝑜) two pass exact 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒2 𝑜 log6 𝑜) 1 + 𝜁 multiplicative LB Ω

log 𝑜 log(1+𝜁) (GMSU16)

Ω 𝑒 log 𝑜 E additive LB Ω

𝑜 𝐹 (GMSU16)

Ω 𝑒𝑜 𝐹

slide-13
SLIDE 13

Structure of Talk

❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

slide-14
SLIDE 14

Warm-Up

❖ Suppose we see string 𝑇, followed by string 𝑈. How can we determine if 𝑇 = 𝑈, with high probability?

slide-15
SLIDE 15

Karp-Rabin Fingerprints

❖ Given base 𝐶 and a prime 𝑄, define 𝜚 𝑇 = σ𝑗=1

𝑜

𝐶𝑗𝑇 𝑗 𝑛𝑝𝑒 𝑄 ❖ If 𝑇 = 𝑈, then 𝜚 𝑇 = 𝜚 𝑈 ❖ If 𝑇 ≠ 𝑈, then 𝜚 𝑇 ≠ 𝜚 𝑈 w.h.p. (Schwartz-Zippel)

slide-16
SLIDE 16

Properties of Karp-Rabin Fingerprints

❖ 𝜚 𝑇[1: 𝑧] = 𝜚 𝑇[1: 𝑦] + 𝐶𝑦𝜚 𝑇[𝑦: 𝑧] (concatenation) ❖ Define 𝜚𝑆 𝑇 = σ𝑗=1

𝑜

𝐶−𝑗𝑇 𝑗 𝑛𝑝𝑒 𝑄 (reversal) ❖ 𝜚 𝑇𝑆[1: 𝑦] = 𝐶𝑦+1𝜚𝑆 𝑇[1: 𝑦] ❖ 𝜚𝑆 𝑇[1: 𝑧] = 𝜚𝑆 𝑇[1: 𝑦] + 𝐶−𝑦𝜚𝑆 𝑇[𝑦: 𝑧] ❖ Can be computed on the fly

slide-17
SLIDE 17

Identifying Palindromes

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-18
SLIDE 18

Identifying Near-Palindromes?

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-19
SLIDE 19

Identifying Near-Palindromes?

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-20
SLIDE 20

Identifying Near-Palindromes?

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-21
SLIDE 21

Identifying Near-Palindromes?

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-22
SLIDE 22

Identifying Near-Palindromes?

❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

slide-23
SLIDE 23

Identifying Near-Palindromes? (CFP+16)

slide-24
SLIDE 24

Karp-Rabin Fingerprints for Subpatterns

❖ 𝑇𝑏,𝑐 = 𝑇 𝑏 𝑇 𝑏 + 𝑐 𝑇 𝑏 + 2𝑐 𝑇 𝑏 + 3𝑐 … ❖ 𝜚𝑏,𝑐 𝑇 = 𝜚 𝑇𝑏,𝑐 = 𝐶 ∗ 𝑇 𝑏 + 𝐶2 ∗ 𝑇 𝑏 + 𝑐 + 𝐶3 ∗ 𝑇 𝑏 + 2𝑐 …

slide-25
SLIDE 25

Identifying Near-Palindromes?

❖ Let ∆ = # 𝑏 𝜚𝑏,𝑐 𝑇 ≠ 𝐶𝑙𝜚𝑏,𝑐

𝑆

𝑇 } ❖ Then ∆ ≤ HAM(𝑇, 𝑇𝑆)

slide-26
SLIDE 26

Identifying Near-Palindromes?

❖ Sample log 𝑜 primes 𝑞1, 𝑞2, … from 16 𝑒 log2 𝑜, 544 𝑒 log2 𝑜 . ❖ Let ∆ = max # 𝑏 𝜚𝑏,𝑞𝑗 𝑇 ≠ 𝐶𝑙𝜚𝑏,𝑞𝑗

𝑆

𝑇 } ❖ ∆ ≤ HAM(𝑇, 𝑇𝑆) ❖ If HAM 𝑇, 𝑇𝑆 > 2𝑒, then ∆ > 1 +

1 16 𝑒 w.h.p. (CFP+16)

What about HAM 𝑇, 𝑇𝑆 ≤ 2𝑒?

slide-27
SLIDE 27

Karp-Rabin Fingerprints for Sub-Subpatterns

slide-28
SLIDE 28

Second-Level Karp-Rabin Fingerprints

❖ Call a mismatch isolated under 𝑞𝑗 if it is the only mismatch under some subpattern 𝑇𝑏,𝑞𝑗. Let 𝐽 be the number of isolated mismatches. ❖ If HAM 𝑇, 𝑇𝑆 ≤ 2𝑒, then 𝐽 = HAM 𝑇, 𝑇𝑆 w.h.p. (CFP+16)

slide-29
SLIDE 29

In Review

❖ There exists a data structure of size 𝑃 𝑒 log6 𝑜 bits that recognizes whether HAM 𝑇, 𝑇𝑆 ≤ 𝑒 w.h.p. ❖ Recently, this has been improved to 𝑃 𝑒 log 𝑜 . (Clifford, Kociumaka, Porat ‘17) ❖ Through black-box reduction, improves our results by 𝑃 log5 𝑜 .

slide-30
SLIDE 30

Additive Error Algorithm

❖ Initialize a data structure every

𝑜 2 positions!

slide-31
SLIDE 31

Additive Error Algorithm

❖ 2 𝑜 sketches, each of size 𝑃 𝑒 log6 𝑜 bits ❖ Total space: 𝑃 𝑒 𝑜 log6 𝑜 bits

slide-32
SLIDE 32

Structure of Talk

❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

slide-33
SLIDE 33

2-Pass Exact Algorithm

❖ Can we modify 1-pass additive algorithm to 2-pass exact? ❖ Missing characters before checkpoint!

slide-34
SLIDE 34

2-Pass Exact Algorithm

❖ Idea: keep all characters before each checkpoint in the second pass ❖ What if there are Ω 𝑜 candidates? ❖ Structural result of palindromes (BEMS14)

slide-35
SLIDE 35

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-36
SLIDE 36

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-37
SLIDE 37

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-38
SLIDE 38

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-39
SLIDE 39

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-40
SLIDE 40

Structural Result of Near-Palindromes

slide-41
SLIDE 41

Structural Result of Near-Palindromes

❖ Goal #1: Recover fingerprints of all overlapping “long” near- palindromes ❖ Goal #2: Use sublinear space in compression

slide-42
SLIDE 42

Structural Result of Near-Palindromes

❖ Not quite periodic (at most 2𝑒 − 1 different words) ❖ Need to save at most 2𝑒 − 1 fingerprints of words

slide-43
SLIDE 43

2-Pass Exact Algorithm

❖ First pass: 𝑃 𝑒2 𝑜 log6 𝑜 bits ❖ At most 2𝑒 − 1 fingerprints, each of size 𝑃 𝑒 log6 𝑜 words ❖ Need to save at 𝑜 characters before 2𝑒 − 1 checkpoints: 𝑃 𝑒 𝑜 ❖ Total space: 𝑃 𝑒2 𝑜 log6 𝑜 bits

slide-44
SLIDE 44

Structure of Talk

❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

slide-45
SLIDE 45

Multiplicative Lower Bounds

❖ Yao’s Principle: find “hard” distribution for deterministic algorithms ❖ Let 𝜉 be the prefix of 10110011100011110000 … = 11011202 …

  • f length

𝑜 4 (GMSU16).

❖ Take 𝑦 ∈ 𝑌 = strings of length

𝑜 4 with weight 𝑒

❖ Take 𝑧 ∈ 𝑍 = 𝑧 | HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1 ❖ Define 𝑡 𝑦, 𝑧 = 𝜉𝑆𝑦𝑧𝑆𝜉.

slide-46
SLIDE 46

Multiplicative Lower Bounds

YES: If HAM 𝑦, 𝑧 ≤ 𝑒, then the longest 𝑒-near-palindrome of 𝑡 𝑦, 𝑧 has length 𝑜. NO: If HAM 𝑦, 𝑧 > 𝑒, then the longest 𝑒-near-palindrome of 𝑡 𝑦, 𝑧 has length at most 200𝑒2 +

𝑜 2.

slide-47
SLIDE 47

Multiplicative Lower Bounds

❖ A 1 + 𝜁 multiplicative algorithm differentiates whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒. ❖ Just need to show cannot differentiate whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 in 𝑝(𝑒 log 𝑜) space!

slide-48
SLIDE 48

Multiplicative Lower Bounds

❖ Save 𝑦 in

𝑒 log 𝑜 3

bits. ❖ Since 𝑦 ∈ 𝑌 = strings of length

𝑜 4 with weight 𝑒 , there are |𝑌| 4

pairs (𝑦, 𝑦′) which are mapped to the same configuration.

slide-49
SLIDE 49

Multiplicative Lower Bounds

❖ Let 𝐽 be the set of indices for which 𝑦𝑗 = 1 or 𝑦𝑗

′ = 1

❖ Suppose HAM 𝑦, 𝑧 = 𝑒 but 𝑧 does not differ from 𝑦 in 𝐽 ❖ 𝑦: 101100000010001000000100100000 ❖ 𝑦’: 100000010010101000000100100000 ❖ 𝑧: 111101100010001011100100100010 ❖ Then HAM 𝑦′, 𝑧 > 𝑒! ❖ Errs on either 𝑡 𝑦, 𝑧 or 𝑡 𝑦′, 𝑧 .

???

slide-50
SLIDE 50

Multiplicative Lower Bounds

❖ There are

|𝑌| 4 values of 𝑦 mapped to the wrong configuration, each

with

𝑜 4 − 2𝑒

𝑒

values of 𝑧, where algorithm is incorrect. ❖ Probability of failure:

|𝑌| 4

𝑜 4 − 2𝑒 𝑒

𝑌 |𝑍| ≥ 1 𝑜

slide-51
SLIDE 51

In Review

❖ Provided a distribution over which any deterministic algorithm with 𝑝(𝑒 log 𝑜) bits fails to distinguish HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 at least

1 𝑜 of the time

❖ A 1 + 𝜁 multiplicative algorithm differentiates whether HAM 𝑦, 𝑧 ≤ 𝑒 or HAM 𝑦, 𝑧 > 𝑒 ❖ Showed every deterministic algorithm fails over random inputs

slide-52
SLIDE 52

Additive Lower Bounds

❖ Define 𝑡 𝑦, 𝑧 = 1𝐹𝑦11

𝐹 𝑒𝑦21 𝐹 𝑒𝑦3 … 𝑦𝑜′ 2

𝑧𝑜′

2

… 𝑧31

𝐹 𝑒𝑧21 𝐹 𝑒𝑧11𝐹

❖ Take 𝑦 ∈ 𝑌 = all strings of length

𝑜′ 2

❖ Take 𝑧 ∈ 𝑍 = HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1

slide-53
SLIDE 53

Open Problems

❖ Can we find the longest 𝑒-near-palindrome in the edit distance? ❖ Longest palindromic subsequence

slide-54
SLIDE 54
slide-55
SLIDE 55

Questions?

55